lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2fb7d110fd9d210e12a61ebb28af6faf330d6421.camel@nvidia.com>
Date: Thu, 22 Aug 2024 01:53:22 +0000
From: Jianbo Liu <jianbol@...dia.com>
To: "jv@...sburgh.net" <jv@...sburgh.net>
CC: "liuhangbin@...il.com" <liuhangbin@...il.com>, "davem@...emloft.net"
	<davem@...emloft.net>, Leon Romanovsky <leonro@...dia.com>, Gal Pressman
	<gal@...dia.com>, "andy@...yhouse.net" <andy@...yhouse.net>, Tariq Toukan
	<tariqt@...dia.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"pabeni@...hat.com" <pabeni@...hat.com>, "edumazet@...gle.com"
	<edumazet@...gle.com>, Saeed Mahameed <saeedm@...dia.com>, "kuba@...nel.org"
	<kuba@...nel.org>
Subject: Re: [PATCH net V5 3/3] bonding: change ipsec_lock from spin lock to
 mutex

On Wed, 2024-08-21 at 09:00 -0700, Jay Vosburgh wrote:
> Jianbo Liu <jianbol@...dia.com> wrote:
> 
> > In the cited commit, bond->ipsec_lock is added to protect
> > ipsec_list,
> > hence xdo_dev_state_add and xdo_dev_state_delete are called inside
> > this lock. As ipsec_lock is a spin lock and such xfrmdev ops may
> > sleep,
> > "scheduling while atomic" will be triggered when changing bond's
> > active slave.
> > 
> > [  101.055189] BUG: scheduling while atomic: bash/902/0x00000200
> > [  101.055726] Modules linked in:
> > [  101.058211] CPU: 3 PID: 902 Comm: bash Not tainted 6.9.0-rc4+ #1
> > [  101.058760] Hardware name:
> > [  101.059434] Call Trace:
> > [  101.059436]  <TASK>
> > [  101.060873]  dump_stack_lvl+0x51/0x60
> > [  101.061275]  __schedule_bug+0x4e/0x60
> > [  101.061682]  __schedule+0x612/0x7c0
> > [  101.062078]  ? __mod_timer+0x25c/0x370
> > [  101.062486]  schedule+0x25/0xd0
> > [  101.062845]  schedule_timeout+0x77/0xf0
> > [  101.063265]  ? asm_common_interrupt+0x22/0x40
> > [  101.063724]  ? __bpf_trace_itimer_state+0x10/0x10
> > [  101.064215]  __wait_for_common+0x87/0x190
> > [  101.064648]  ? usleep_range_state+0x90/0x90
> > [  101.065091]  cmd_exec+0x437/0xb20 [mlx5_core]
> > [  101.065569]  mlx5_cmd_do+0x1e/0x40 [mlx5_core]
> > [  101.066051]  mlx5_cmd_exec+0x18/0x30 [mlx5_core]
> > [  101.066552]  mlx5_crypto_create_dek_key+0xea/0x120 [mlx5_core]
> > [  101.067163]  ? bonding_sysfs_store_option+0x4d/0x80 [bonding]
> > [  101.067738]  ? kmalloc_trace+0x4d/0x350
> > [  101.068156]  mlx5_ipsec_create_sa_ctx+0x33/0x100 [mlx5_core]
> > [  101.068747]  mlx5e_xfrm_add_state+0x47b/0xaa0 [mlx5_core]
> > [  101.069312]  bond_change_active_slave+0x392/0x900 [bonding]
> > [  101.069868]  bond_option_active_slave_set+0x1c2/0x240 [bonding]
> > [  101.070454]  __bond_opt_set+0xa6/0x430 [bonding]
> > [  101.070935]  __bond_opt_set_notify+0x2f/0x90 [bonding]
> > [  101.071453]  bond_opt_tryset_rtnl+0x72/0xb0 [bonding]
> > [  101.071965]  bonding_sysfs_store_option+0x4d/0x80 [bonding]
> > [  101.072567]  kernfs_fop_write_iter+0x10c/0x1a0
> > [  101.073033]  vfs_write+0x2d8/0x400
> > [  101.073416]  ? alloc_fd+0x48/0x180
> > [  101.073798]  ksys_write+0x5f/0xe0
> > [  101.074175]  do_syscall_64+0x52/0x110
> > [  101.074576]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > As bond_ipsec_add_sa_all and bond_ipsec_del_sa_all are only called
> > from bond_change_active_slave, which requires holding the RTNL
> > lock.
> > And bond_ipsec_add_sa and bond_ipsec_del_sa are xfrm state
> > xdo_dev_state_add and xdo_dev_state_delete APIs, which are in user
> > context. So ipsec_lock doesn't have to be spin lock, change it to
> > mutex, and thus the above issue can be resolved.
> > 
> > Fixes: 9a5605505d9c ("bonding: Add struct bond_ipesc to manage SA")
> > Signed-off-by: Jianbo Liu <jianbol@...dia.com>
> > Signed-off-by: Tariq Toukan <tariqt@...dia.com>
> > Reviewed-by: Hangbin Liu <liuhangbin@...il.com>
> > ---
> > drivers/net/bonding/bond_main.c | 67 +++++++++++++++---------------
> > ---
> > include/net/bonding.h           |  2 +-
> > 2 files changed, 32 insertions(+), 37 deletions(-)
> > 
> > diff --git a/drivers/net/bonding/bond_main.c
> > b/drivers/net/bonding/bond_main.c
> > index 0d1129eaf47b..f20f6d83ad54 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -439,38 +439,33 @@ static int bond_ipsec_add_sa(struct
> > xfrm_state *xs,
> >         rcu_read_lock();
> >         bond = netdev_priv(bond_dev);
> >         slave = rcu_dereference(bond->curr_active_slave);
> > -       if (!slave) {
> > -               rcu_read_unlock();
> > +       real_dev = slave ? slave->dev : NULL;
> > +       rcu_read_unlock();
> > +       if (!real_dev)
> >                 return -ENODEV;
> 
>         In reading these, I was confused as to why some changes use
> rcu_read_lock(), rcu_dereference() and others use rtnl_dereference();
> I
> think it's because bond_ipsec_{add,del}_sa_all() are guaranteed to be
> called under RTNL, while the bond_ipsec_{add,del}_sa() functions are
> do
> not have that guarantee.  Am I understanding correctly?
> 

Right. bond_ipsec_{add,del}_sa_all() are called by
bond_change_active_slave() which has ASSERT_RTNL(), so I think they are
under RTNL.

> > -       }
> > 
> > -       real_dev = slave->dev;
> >         if (!real_dev->xfrmdev_ops ||
> >             !real_dev->xfrmdev_ops->xdo_dev_state_add ||
> >             netif_is_bond_master(real_dev)) {
> >                 NL_SET_ERR_MSG_MOD(extack, "Slave does not support
> > ipsec offload");
> > -               rcu_read_unlock();
> >                 return -EINVAL;
> >         }
> > 
> > -       ipsec = kmalloc(sizeof(*ipsec), GFP_ATOMIC);
> > -       if (!ipsec) {
> > -               rcu_read_unlock();
> > +       ipsec = kmalloc(sizeof(*ipsec), GFP_KERNEL);
> > +       if (!ipsec)
> >                 return -ENOMEM;
> 
>         Presumably the switch from ATOMIC to KERNEL is safe because
> this
> is only called under RTNL (and therefore always has a process
> context),
> i.e., this change is independent of any other changes in the patch.
> Correct?
> 

No. And it's RCU here, not RTNL. We are safe to use KERNEL after it's
out of the RCU context, right? And this was suggested by Paolo after he
reviewd the first version.      

> > -       }
> > 
> >         xs->xso.real_dev = real_dev;
> >         err = real_dev->xfrmdev_ops->xdo_dev_state_add(xs, extack);
> >         if (!err) {
> >                 ipsec->xs = xs;
> >                 INIT_LIST_HEAD(&ipsec->list);
> > -               spin_lock_bh(&bond->ipsec_lock);
> > +               mutex_lock(&bond->ipsec_lock);
> >                 list_add(&ipsec->list, &bond->ipsec_list);
> > -               spin_unlock_bh(&bond->ipsec_lock);
> > +               mutex_unlock(&bond->ipsec_lock);
> >         } else {
> >                 kfree(ipsec);
> >         }
> > -       rcu_read_unlock();
> >         return err;
> > }
> > 
> > @@ -481,35 +476,35 @@ static void bond_ipsec_add_sa_all(struct
> > bonding *bond)
> >         struct bond_ipsec *ipsec;
> >         struct slave *slave;
> > 
> > -       rcu_read_lock();
> > -       slave = rcu_dereference(bond->curr_active_slave);
> > -       if (!slave)
> > -               goto out;
> > +       slave = rtnl_dereference(bond->curr_active_slave);
> > +       real_dev = slave ? slave->dev : NULL;
> > +       if (!real_dev)
> > +               return;
> > 
> > -       real_dev = slave->dev;
> > +       mutex_lock(&bond->ipsec_lock);
> >         if (!real_dev->xfrmdev_ops ||
> >             !real_dev->xfrmdev_ops->xdo_dev_state_add ||
> >             netif_is_bond_master(real_dev)) {
> > -               spin_lock_bh(&bond->ipsec_lock);
> >                 if (!list_empty(&bond->ipsec_list))
> >                         slave_warn(bond_dev, real_dev,
> >                                    "%s: no slave
> > xdo_dev_state_add\n",
> >                                    __func__);
> > -               spin_unlock_bh(&bond->ipsec_lock);
> >                 goto out;
> >         }
> > 
> > -       spin_lock_bh(&bond->ipsec_lock);
> >         list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > +               /* If new state is added before ipsec_lock acquired
> > */
> > +               if (ipsec->xs->xso.real_dev == real_dev)
> > +                       continue;
> > +
> >                 ipsec->xs->xso.real_dev = real_dev;
> >                 if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec-
> > >xs, NULL)) {
> >                         slave_warn(bond_dev, real_dev, "%s: failed
> > to add SA\n", __func__);
> >                         ipsec->xs->xso.real_dev = NULL;
> >                 }
> >         }
> > -       spin_unlock_bh(&bond->ipsec_lock);
> > out:
> > -       rcu_read_unlock();
> > +       mutex_unlock(&bond->ipsec_lock);
> > }
> > 
> > /**
> > @@ -530,6 +525,8 @@ static void bond_ipsec_del_sa(struct xfrm_state
> > *xs)
> >         rcu_read_lock();
> >         bond = netdev_priv(bond_dev);
> >         slave = rcu_dereference(bond->curr_active_slave);
> > +       real_dev = slave ? slave->dev : NULL;
> > +       rcu_read_unlock();
> 
>         Is it really safe to access real_dev once we've left the rcu
> critical section?  What prevents the device referenced by real_dev
> from
> being deleted as soon as rcu_read_unlock() completes?
> 

I am not sure. But RCU protects accessing of the context pointed by
curr_active_slave, not slave->dev itself. I wrong about this?
I can move rcu_read_unlock after xdo_dev_state_delete(). And do the
same change for bond_ipsec_add_sa and bond_ipsec_free_sa.
What do you think?

Thanks!
Jianbo 

>         -J
>         
> > 
> >         if (!slave)
> >                 goto out;
> > @@ -537,7 +534,6 @@ static void bond_ipsec_del_sa(struct xfrm_state
> > *xs)
> >         if (!xs->xso.real_dev)
> >                 goto out;
> > 
> > -       real_dev = slave->dev;
> >         WARN_ON(xs->xso.real_dev != real_dev);
> > 
> >         if (!real_dev->xfrmdev_ops ||
> > @@ -549,7 +545,7 @@ static void bond_ipsec_del_sa(struct xfrm_state
> > *xs)
> > 
> >         real_dev->xfrmdev_ops->xdo_dev_state_delete(xs);
> > out:
> > -       spin_lock_bh(&bond->ipsec_lock);
> > +       mutex_lock(&bond->ipsec_lock);
> >         list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> >                 if (ipsec->xs == xs) {
> >                         list_del(&ipsec->list);
> > @@ -557,8 +553,7 @@ static void bond_ipsec_del_sa(struct xfrm_state
> > *xs)
> >                         break;
> >                 }
> >         }
> > -       spin_unlock_bh(&bond->ipsec_lock);
> > -       rcu_read_unlock();
> > +       mutex_unlock(&bond->ipsec_lock);
> > }
> > 
> > static void bond_ipsec_del_sa_all(struct bonding *bond)
> > @@ -568,15 +563,12 @@ static void bond_ipsec_del_sa_all(struct
> > bonding *bond)
> >         struct bond_ipsec *ipsec;
> >         struct slave *slave;
> > 
> > -       rcu_read_lock();
> > -       slave = rcu_dereference(bond->curr_active_slave);
> > -       if (!slave) {
> > -               rcu_read_unlock();
> > +       slave = rtnl_dereference(bond->curr_active_slave);
> > +       real_dev = slave ? slave->dev : NULL;
> > +       if (!real_dev)
> >                 return;
> > -       }
> > 
> > -       real_dev = slave->dev;
> > -       spin_lock_bh(&bond->ipsec_lock);
> > +       mutex_lock(&bond->ipsec_lock);
> >         list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> >                 if (!ipsec->xs->xso.real_dev)
> >                         continue;
> > @@ -593,8 +585,7 @@ static void bond_ipsec_del_sa_all(struct
> > bonding *bond)
> >                                 real_dev->xfrmdev_ops-
> > >xdo_dev_state_free(ipsec->xs);
> >                 }
> >         }
> > -       spin_unlock_bh(&bond->ipsec_lock);
> > -       rcu_read_unlock();
> > +       mutex_unlock(&bond->ipsec_lock);
> > }
> > 
> > static void bond_ipsec_free_sa(struct xfrm_state *xs)
> > @@ -5917,7 +5908,7 @@ void bond_setup(struct net_device *bond_dev)
> >         /* set up xfrm device ops (only supported in active-backup
> > right now) */
> >         bond_dev->xfrmdev_ops = &bond_xfrmdev_ops;
> >         INIT_LIST_HEAD(&bond->ipsec_list);
> > -       spin_lock_init(&bond->ipsec_lock);
> > +       mutex_init(&bond->ipsec_lock);
> > #endif /* CONFIG_XFRM_OFFLOAD */
> > 
> >         /* don't acquire bond device's netif_tx_lock when
> > transmitting */
> > @@ -5966,6 +5957,10 @@ static void bond_uninit(struct net_device
> > *bond_dev)
> >                 __bond_release_one(bond_dev, slave->dev, true,
> > true);
> >         netdev_info(bond_dev, "Released all slaves\n");
> > 
> > +#ifdef CONFIG_XFRM_OFFLOAD
> > +       mutex_destroy(&bond->ipsec_lock);
> > +#endif /* CONFIG_XFRM_OFFLOAD */
> > +
> >         bond_set_slave_arr(bond, NULL, NULL);
> > 
> >         list_del_rcu(&bond->bond_list);
> > diff --git a/include/net/bonding.h b/include/net/bonding.h
> > index b61fb1aa3a56..8bb5f016969f 100644
> > --- a/include/net/bonding.h
> > +++ b/include/net/bonding.h
> > @@ -260,7 +260,7 @@ struct bonding {
> > #ifdef CONFIG_XFRM_OFFLOAD
> >         struct list_head ipsec_list;
> >         /* protecting ipsec_list */
> > -       spinlock_t ipsec_lock;
> > +       struct mutex ipsec_lock;
> > #endif /* CONFIG_XFRM_OFFLOAD */
> >         struct bpf_prog *xdp_prog;
> > };
> > -- 
> > 2.21.0
> > 
> 
> ---
>         -Jay Vosburgh, jv@...sburgh.net

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ