lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20201014193615.1045792-1-michael.auchter@ni.com>
Date:   Wed, 14 Oct 2020 14:36:12 -0500
From:   Michael Auchter <michael.auchter@...com>
To:     devicetree@...r.kernel.org, linux-kernel@...r.kernel.org
Cc:     saravanak@...gle.com, robh+dt@...nel.org, frowand.list@...il.com,
        gregkh@...uxfoundation.org, rafael@...nel.org,
        Michael Auchter <michael.auchter@...com>
Subject: [RFC PATCH 0/3] Fix errors on DT overlay removal with devlinks

After updating to v5.9, I've started seeing errors in the kernel log
when using device tree overlays. Specifically, the problem seems to
happen when removing a device tree overlay that contains two devices
with some dependency between them (e.g., a device that provides a clock
and a device that consumes that clock). Removing such an overlay results
in:

  OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy
  OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy

followed by hitting some REFCOUNT_WARNs in refcount.c

In the first patch, I've included a unittest that can be used to
reproduce this when built with CONFIG_OF_UNITTEST [1].

I believe the issue is caused by the cleanup performed when releasing
the devlink device that's created to represent the dependency between
devices. The devlink device has references to the consumer and supplier
devices, which it drops in device_link_free; the devlink device's
release callback calls device_link_free via call_srcu.

When the overlay is being removed, all devices are removed, and
eventually the release callback for the devlink device run, and
schedules cleanup using call_srcu. Before device_link_free can and call
put_device on the consumer/supplier, the rest of the overlay removal
process runs, resulting in the error traces above.

Patches 2 and 3 are an attempt at fixing this: call srcu_barrier to wait
for any pending device_link_free's to execute before continuing on with
the removal process.

These patches resolve the issue, but probably not in the best way. In
particular, it seems strange to need to leak details of devlinks into
the device tree overlay code. So, I'd be curious to get some feedback or
hear any other ideas for how to resolve this issue.

Thanks,
 Michael

1. Note that this isn't a very good unit test: it will report a "pass"
   even if it fails with the aforementioned errors, as these errors
   aren't propogated.

Michael Auchter (3):
  of: unittest: add test of overlay with devlinks
  driver core: add device_links_barrier
  of: dynamic: add device links barrier before detach

 drivers/base/core.c                     | 10 ++++++++++
 drivers/of/dynamic.c                    |  3 +++
 drivers/of/unittest-data/Makefile       |  1 +
 drivers/of/unittest-data/overlay_16.dts | 26 +++++++++++++++++++++++++
 drivers/of/unittest.c                   | 16 +++++++++++++++
 include/linux/device.h                  |  1 +
 6 files changed, 57 insertions(+)
 create mode 100644 drivers/of/unittest-data/overlay_16.dts

-- 
2.25.4

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ