lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <174265415457.356712.10472727127735290090.stgit@pro.pro>
Date: Sat, 22 Mar 2025 17:37:41 +0300
From: Kirill Tkhai <tkhai@...ru>
To: netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org
Cc: tkhai@...ru
Subject: [PATCH NET-PREV 00/51] Kill rtnl_lock using fine-grained nd_lock

Hi,

this patchset shows the way to completely remove rtnl lock and that
this process can be done iteratively without any shocks. It implements
the architecture of new fine-grained locking to use instead of rtnl,
and iteratively converts many drivers to use it.

I mostly write this mostly a few years ago, more or less recently
I rebased the patches on kernel around 6.11 (there should not
be many conflicts on that version). Currenly I have no plans
to complete this.

If anyone wants to continue, this person can take this patchset
and done the work.

Kirill Tkhai

-----------------------------------------------------------------------

The stages and what is done.

0)Introduce nd_lock. lock_netdev(), double_lock_netdev(), ordering,
  nd_lock_transfer_devices()

1)The target of this stage is to attach nd_lock to every registering
  device and to keep it locked during netdevice registration.

  The significant thing is that we want to have two or more netdevices
  sharing the same nd_lock in case of they are configured or modified
  together during their lifetime. E.g., bridge and all their port must
  share the same nd_lock. Bond and bound devices mush share the same
  nd_lock. Both peer of veth are also. Upper and lower, master and slave
  devices too.

  [This is recursive rule. E.g., if veth is a port of bridge, then
   the both veth peers, all bridge ports and bridge itself must relate
   to the same nd_lock].
  
  Net devices in kernel are registered via register_netdev() and register_netdevices()
  functions.

  a)register_netdev() can't be called nested in a configuration actions
    (since it's called with rtnl_mutex unlocked), and usually it is used
    to register a standalone device (say, driver for physical device.
    There are exceptions, and we'll talk about them). So, this primitive
    is not very interesting for us.

  b)register_netdevice() may be called nested, e.g. from netdevice notifier,
    or two devices may be registered at once (e.g., from .newlink or
    .changelink). Later these two devices usually modified at the same time
    together. Users of this primitive want modifications to make devices
    requiring to share nd_lock really relate to same nd_lock.

  To make this, we introduce a new variation __register_netdevice().
  The difference between __register_netdevice() and register_netdevice()
  is in that register_netdevice() allocates and attaches a new nd_lock
  to device, while __register_netdevice() should be used in the cases,
  when nd_lock is inherited from bound device (say, the second veth peer
  inherits nd_lock from the first peer).

  Important thing to say is that despite in general register_netdev()
  is used to register a standalone device, there are some (mentioned)
  exceptions, where several devices may used together in netdevice
  notifiers (one of examples is mlx5 drivers family). To handle such
  cases and to make their modifications not vital, we connect such
  devices to a special fallback_nd_lock. Connected to this fallback_nd_lock,
  such devices will share the same nd_lock like we want.
  Note, that register_netdev() are changed to use fallback_nd_lock
  for every registering device to minimize number of patches for now.
  For the most drivers, register_netdev() should be replaced with
  register_netdevice() under new nd_lock is locked like it's done
  in the patch for loopback. See that patch for the example.

2)The target of this stage is to modify .newlink and .changelink
  to make registering and attaching devices sharing the same nd_lock.
  
  Currently, rtnl_newlink() and rtnl_setlink() stacks look like:

  rtnl_setlink()
    dev1 = __dev_get_by_index(index1)
    dev2 = rtnl_create_link()
    ops->newlink() or ops->changelink()
      dev3 = __dev_get_by_index(index3)
      netdev_upper_dev_link(dev2, dev3)
    do_set_master()
      dev4 = __dev_get_by_index(index4)

  These dev1, dev2, dev3 and dev4 have to relate the same nd_lock.
  Transfering between two nd_locks requires to own both of them
  (see nd_lock_transfer_devices()). But it's impossible in the above
  stack since a nested taking of nd_lock requires to follow ordering
  rules.

  To make nested locking possible, we transform the stack in the below:

  rtnl_setlink()
    dev1 = __dev_get_by_index(index1)
    dev4 = __dev_get_by_index(index4)
    dev3 = __dev_get_by_index(index3)

    double_lock_netdev(dev1, &nd_lockA, dev4, &nd_lockB)
    nd_lock_transfer_devices(&nd_lockA, &nd_lockB)        // Now dev1 and dev4 relate
    double_unlock_netdev(nd_lockA, nd_lockB)              // to same nd_lock

    double_lock_netdev(dev1, &nd_lockA, dev3, &nd_lockB)
    nd_lock_transfer_devices(&nd_lockA, &nd_lockB)        // Now dev1 and dev3 relate
    double_unlock_netdev(nd_lockA, nd_lockB)              // to same nd_lock (and dev4)

    lock_netdev(dev1, &nd_lockA)
    
    dev2 = rtnl_create_link()
    attach_nd_lock(dev2, nd_lockA) // Now all four devices share the same nd lock

    ops->newlink() or ops->changelink()
      dev3 = __dev_get_by_index(index3)
      netdev_upper_dev_link(dev2, dev3)

    do_set_master(dev4)

    unlock_netdev(nd_lockA)

  It's important to see that in this example rtnl_setlink() knows that ops->newlink
  will dereference device with index3. To make this possible, struct rtnl_link_ops
  was extended by two deps members: .newlink_deps and .changelink_deps. Instead of
  describing them formally, I'll show them on two examples:

    struct link_deps bond_changelink_deps = {
      .optional.data = { IFLA_BOND_ACTIVE_SLAVE, IFLA_BOND_PRIMARY, },
    };

    static struct link_deps hsr_newlink_deps = {
      .mandatory.data = { IFLA_HSR_SLAVE1, IFLA_HSR_SLAVE2, },
      .optional.data = { IFLA_HSR_INTERLINK, },
    };

    struct link_deps generic_newlink_deps = {
      .mandatory.tb = { IFLA_LINK, }
    };


  These ids are that bond_changelink(), hsr_newlink() and most of other
  drivers dereference. They may be mandatory and optional in dependence
  of the way .newlink/.changelink react on devices with such indexes
  exist or not. The rest .data and .tb is related to the array used
  by .newlink and .changelink to get device id:

  int                     (*newlink)(struct net *src_net,
                                     struct net_device *dev,
                                     struct nlattr *tb[],     <---
                                     struct nlattr *data[],   <---
                                     struct netlink_ext_ack *extack);


  After we introduces .newlink_deps and .changelink_deps and we simply
  use corrent nd_lock to attach nested device in .newlink and .changelink.

3)The target of this stage is to make the rest of drivers using register_netdevice()
  to share the same nd_lock by bound devices. Examples: dsa, cfg80211, etc.
  Here are just non-invasive refactoring.

  Now we have all drivers changed to share nd_lock in correct way.

4)At this stage we make all netdev_master_upper_dev_link() called under nd_lock.

  Now all connecting to master is protected by nd_lock (unlink is not).

5)At this stage we make all dev_change_net_namespace() called under nd_lock.

6)Next is to make NETDEV_REGISTER event notifier be called under nd_lock.

  After previous stage it's not difficult.
  See comments about netdevice events in its chapter below.

The patchset ends at this stage.

What is next?

1)More and more netdev parameters should be placed under locked
  nd_lock like it's done in this patchset. Netdevice events, rtnl
  callbacks, ioctls, etc.

  The target is to place a taking nd_lock of device upper in stack
  and to bring everything

  from:
    ioctl()
      rtnl_lock()
      dev = __dev_get_by_index(index)
        func1()
          change dev parameter1
          func2()
            lock_netdev(dev)          <-- attention here
            netdev_master_upper_dev_link()
            change dev parameter2

  to:
    ioctl()
      rtnl_lock()
      dev = __dev_get_by_index(index)
      lock_netdev(dev)                <-- attention here
        func1()
          change dev parameter1
          func2()
            netdev_master_upper_dev_link()
            change dev parameter2

  After we complete this, all device parametes will be protected
  by nd_lock.

  Keep in mind, that despite we introduced nd_lock and begun
  to use it, there is no concurrency during access to device
  parameters yet until rtnl_mutex is taken first. Everything
  is still protected by rtnl_mutex. So, our responsibility
  is not to prevent races connected with introduction nd_lock.
  Our responcibility is tracing nested functions calls and
  to prevent taking nd_lock when we already own it.

2)The next stage is we change rtnl_mutex and nd_lock order.
  We dereference devices from RCU lists.

    ioctl(index)
      dev = netdev_get_by_index_locked(index, &nd_lock) <-- attention to _locked suffix
      rtnl_lock()                                       <-- and here
        func1()
          change dev parameter1
          func2()
            netdev_master_upper_dev_link()
            change dev parameter2
      ...
      rtnl_unlock()
      unlock_netdev(nd_lock)

  A new function netdev_get_by_index_locked() from the above:

  struct net_device *netdev_get_by_index_locked(index, nd_lock_ptr)
  {
  again:
      dev = netdev_get_by_index(index, tracker);
      if (!dev)
        return NULL;

      if (!lock_netdev(dev, nd_lock_ptr)) {   /* device was unregistered in parallel */
         netdev_put(dev, tracker);
         goto again;
      }
      netdev_put(dev, tracker);

      if (dev->ifindex != index) {
        unlock_netdev(*nd_lock_ptr);
        goto again;
      }

      return dev;
  }

  The same way will look a new function taking two locks of two devices.

3)Now rtnl_lock() will go down in stack and this way will be fast:

  ioctl(index)
      dev = netdev_get_by_index_locked(index, &nd_lock)
        func1()
          change dev parameter1
          func2()
            netdev_master_upper_dev_link()
            rtnl_lock();                    <--- now it is here
            change dev parameter2
            rtnl_unlock()
      ...
      unlock_netdev(nd_lock)

  We will move more parameters out of rtnl_lock and one day nothing
  will be protected using it. But we may leave it to protect something
  unrelated to net devices by historical reasons.

***)Netdevice events

int register_netdevice_notifier(struct notifier_block *nb)

int register_netdevice_notifier_net(struct net *net, struct notifier_block *nb)


int register_netdevice_notifier_dev_net(struct net_device *dev,
                                        struct notifier_block *nb,
                                        struct netdev_net_notifier *nn)

  This may be converted into register_nd_lock_group_notifier()
---

Kirill Tkhai (51):
      net: Move some checks from __rtnl_newlink() to caller
      net: Add nlaattr check to rtnl_link_get_net_capable()
      net: do_setlink() refactoring: move target_net acquiring to callers
      net: Extract some code from __rtnl_newlink() to separate func
      net: Move dereference of tb[IFLA_MASTER] up
      net: Use unregister_netdevice_many() for both error cases in rtnl_newlink_create()
      net: Introduce nd_lock and primitives to work with it
      net: Initially attaching and detaching nd_lock
      net: Use register_netdevice() in loopback()
      net: Underline newlink and changelink dependencies
      net: Make master and slaves (any dependent devices) share the same nd_lock in .setlink etc
      net: Use __register_netdevice in trivial .newlink cases
      infiniband_ipoib: Use __register_netdevice in .newlink
      vxcan: Use __register_netdevice in .newlink
      iavf: Use __register_netdevice()
      geneve: Use __register_netdevice in .newlink
      netkit: Use __register_netdevice in .newlink
      qmi_wwan: Use __register_netdevice in .newlink
      bpqether: Provide determined context in __register_netdevice()
      ppp: Use __register_netdevice in .newlink
      veth: Use __register_netdevice in .newlink
      vxlan: Use __register_netdevice in .newlink
      hdlc_fr: Use __register_netdevice
      lapbeth: Provide determined context in __register_netdevice()
      wwan: Use __register_netdevice in .newlink
      6lowpan: Use __register_netdevice in .newlink
      vlan: Use __register_netdevice in .newlink
      dsa: Use __register_netdevice()
      ip6gre: Use __register_netdevice() in .changelink
      ip6_tunnel: Use __register_netdevice() in .newlink and .changelink
      ip6_vti: Use __register_netdevice() in .newlink and .changelink
      ip6_sit: Use __register_netdevice() in .newlink and .changelink
      net: Now check nobody calls register_netdevice() with nd_lock attached
      dsa: Make all switch tree ports relate to same nd_lock
      cfg80211: Use fallback_nd_lock for registered devices
      ieee802154: Use fallback_nd_lock for registered devices
      net: Introduce delayed event work
      failover: Link master and slave under nd_lock
      netvsc: Make joined device to share master's nd_lock
      openvswitch: Make ports share nd_lock of master device
      bridge: Make port to have the same nd_lock as bridge
      bond: Make master and slave relate to the same nd_lock
      net: Now check nobody calls netdev_master_upper_dev_link() without nd_lock attached
      net: Call dellink with nd_lock is held
      t7xx: Use __unregister_netdevice()
      6lowpan: Use __unregister_netdevice()
      netvsc: Call dev_change_net_namespace() under nd_lock
      default_device: Call dev_change_net_namespace() under nd_lock
      ieee802154: Call dev_change_net_namespace() under nd_lock
      cfg80211: Call dev_change_net_namespace() under nd_lock
      net: Make all NETDEV_REGISTER events to be called under nd_lock


 drivers/infiniband/ulp/ipoib/ipoib_netlink.c       |    3 
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c          |   12 
 drivers/net/amt.c                                  |    9 
 drivers/net/bareudp.c                              |    2 
 drivers/net/bonding/bond_main.c                    |    4 
 drivers/net/bonding/bond_netlink.c                 |    7 
 drivers/net/bonding/bond_options.c                 |    4 
 drivers/net/can/vxcan.c                            |    8 
 drivers/net/ethernet/intel/iavf/iavf_main.c        |   59 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c |    3 
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c    |    2 
 drivers/net/geneve.c                               |   12 
 drivers/net/gtp.c                                  |    2 
 drivers/net/hamradio/bpqether.c                    |   33 +
 drivers/net/hyperv/netvsc_drv.c                    |   28 +
 drivers/net/ipvlan/ipvlan_main.c                   |    5 
 drivers/net/ipvlan/ipvtap.c                        |    1 
 drivers/net/loopback.c                             |    6 
 drivers/net/macsec.c                               |    5 
 drivers/net/macvlan.c                              |    5 
 drivers/net/macvtap.c                              |    1 
 drivers/net/netkit.c                               |    8 
 drivers/net/pfcp.c                                 |    2 
 drivers/net/ppp/ppp_generic.c                      |   13 
 drivers/net/team/team_core.c                       |    2 
 drivers/net/usb/qmi_wwan.c                         |   14 
 drivers/net/veth.c                                 |   11 
 drivers/net/vrf.c                                  |    6 
 drivers/net/vxlan/vxlan_core.c                     |   42 +
 drivers/net/wan/hdlc_fr.c                          |   18 -
 drivers/net/wan/lapbether.c                        |   28 +
 drivers/net/wireguard/device.c                     |    2 
 drivers/net/wireless/ath/ath6kl/core.c             |    2 
 drivers/net/wireless/ath/wil6210/netdev.c          |    2 
 drivers/net/wireless/marvell/mwifiex/main.c        |    5 
 drivers/net/wireless/quantenna/qtnfmac/core.c      |    2 
 drivers/net/wireless/virtual/virt_wifi.c           |    5 
 drivers/net/wwan/iosm/iosm_ipc_wwan.c              |    2 
 drivers/net/wwan/mhi_wwan_mbim.c                   |    2 
 drivers/net/wwan/t7xx/t7xx_netdev.c                |    4 
 drivers/net/wwan/wwan_core.c                       |   13 
 include/linux/netdevice.h                          |   38 +
 include/net/rtnetlink.h                            |   16 +
 net/6lowpan/core.c                                 |   16 -
 net/8021q/vlan.c                                   |   11 
 net/8021q/vlan_netlink.c                           |    1 
 net/batman-adv/soft-interface.c                    |    2 
 net/bridge/br_ioctl.c                              |    8 
 net/bridge/br_netlink.c                            |    2 
 net/caif/chnl_net.c                                |    2 
 net/core/dev.c                                     |  576 +++++++++++++++++++-
 net/core/dev_ioctl.c                               |    1 
 net/core/failover.c                                |   24 +
 net/core/rtnetlink.c                               |  478 ++++++++++++-----
 net/dsa/dsa.c                                      |   14 
 net/dsa/netlink.c                                  |    5 
 net/dsa/user.c                                     |   25 +
 net/hsr/hsr_device.c                               |    4 
 net/hsr/hsr_netlink.c                              |    6 
 net/ieee802154/6lowpan/core.c                      |    1 
 net/ieee802154/core.c                              |    2 
 net/ieee802154/nl802154.c                          |    8 
 net/ipv4/ip_tunnel.c                               |    4 
 net/ipv6/ip6_gre.c                                 |   28 +
 net/ipv6/ip6_tunnel.c                              |   37 +
 net/ipv6/ip6_vti.c                                 |   36 +
 net/ipv6/sit.c                                     |   45 +-
 net/mac80211/main.c                                |    2 
 net/mac802154/cfg.c                                |    2 
 net/mac802154/iface.c                              |   10 
 net/mac802154/main.c                               |    2 
 net/openvswitch/vport-netdev.c                     |    6 
 net/wireless/core.c                                |   12 
 net/wireless/nl80211.c                             |   15 +
 net/xfrm/xfrm_interface_core.c                     |    2 
 75 files changed, 1532 insertions(+), 303 deletions(-)

--
Signed-off-by: Kirill Tkhai <tkhai@...ru>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ