[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240305112708.56869e4c@bootlin.com>
Date: Tue, 5 Mar 2024 11:27:08 +0100
From: Herve Codina <herve.codina@...tlin.com>
To: Nuno Sá <noname.nuno@...il.com>, Saravana Kannan
<saravanak@...gle.com>, Rob Herring <robh@...nel.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, "Rafael J. Wysocki"
<rafael@...nel.org>, Frank Rowand <frowand.list@...il.com>, Lizhi Hou
<lizhi.hou@....com>, Max Zhen <max.zhen@....com>, Sonal Santan
<sonal.santan@....com>, Stefano Stabellini <stefano.stabellini@...inx.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
linux-kernel@...r.kernel.org, devicetree@...r.kernel.org, Allan Nielsen
<allan.nielsen@...rochip.com>, Horatiu Vultur
<horatiu.vultur@...rochip.com>, Steen Hegelund
<steen.hegelund@...rochip.com>, Luca Ceresoli <luca.ceresoli@...tlin.com>,
Nuno Sa <nuno.sa@...log.com>, Thomas Petazzoni
<thomas.petazzoni@...tlin.com>, stable@...r.kernel.org
Subject: Re: [PATCH v3 2/2] of: overlay: Synchronize of_overlay_remove()
with the devlink removals
Hi Nuno, Saravana, Rob,
On Tue, 05 Mar 2024 08:36:45 +0100
Nuno Sá <noname.nuno@...il.com> wrote:
> On Mon, 2024-03-04 at 22:47 -0800, Saravana Kannan wrote:
> > On Mon, Mar 4, 2024 at 8:49 AM Herve Codina <herve.codina@...tlin.com> wrote:
> > >
> > > Hi Rob,
> > >
> > > On Mon, 4 Mar 2024 09:22:02 -0600
> > > Rob Herring <robh@...nel.org> wrote:
> > >
> > > ...
> > >
> > > > > > @@ -853,6 +854,14 @@ static void free_overlay_changeset(struct
> > > > > > overlay_changeset *ovcs)
> > > > > > {
> > > > > > int i;
> > > > > >
> > > > > > + /*
> > > > > > + * Wait for any ongoing device link removals before removing some of
> > > > > > + * nodes. Drop the global lock while waiting
> > > > > > + */
> > > > > > + mutex_unlock(&of_mutex);
> > > > > > + device_link_wait_removal();
> > > > > > + mutex_lock(&of_mutex);
> > > > >
> > > > > I'm still not convinced we need to drop the lock. What happens if
> > > > > someone else
> > > > > grabs the lock while we are in device_link_wait_removal()? Can we
> > > > > guarantee that
> > > > > we can't screw things badly?
> > > >
> > > > It is also just ugly because it's the callers of
> > > > free_overlay_changeset() that hold the lock and now we're releasing it
> > > > behind their back.
> > > >
> > > > As device_link_wait_removal() is called before we touch anything, can't
> > > > it be called before we take the lock? And do we need to call it if
> > > > applying the overlay fails?
> >
> > Rob,
> >
> > This[1] scenario Luca reported seems like a reason for the
> > device_link_wait_removal() to be where Herve put it. That example
> > seems reasonable.
> >
> > [1] - https://lore.kernel.org/all/20231220181627.341e8789@booty/
> >
>
> I'm still not totally convinced about that. Why not putting the check right
> before checking the kref in __of_changeset_entry_destroy(). I'll contradict
> myself a bit because this is just theory but if we look at pci_stop_dev(), which
> AFAIU, could be reached from a sysfs write(), we have:
>
> device_release_driver(&dev->dev);
> ...
> of_pci_remove_node(dev);
> of_changeset_revert(np->data);
> of_changeset_destroy(np->data);
>
> So looking at the above we would hit the same issue if we flush the queue in
> free_overlay_changeset() - as the queue won't be flushed at all and we could
> have devlink removal due to device_release_driver(). Right?
>
> Again, completely theoretical but seems like a reasonable one plus I'm not
> understanding the push against having the flush in
> __of_changeset_entry_destroy(). Conceptually, it looks the best place to me but
> I may be missing some issue in doing it there?
Instead of having the wait called in __of_changeset_entry_destroy() and so
called in a loop. I could move this call in the __of_changeset_entry_destroy()
caller (without any of_mutex lock drop).
So this will look like this:
--- 8< ---
void of_changeset_destroy(struct of_changeset *ocs)
{
struct of_changeset_entry *ce, *cen;
device_link_wait_removal();
list_for_each_entry_safe_reverse(ce, cen, &ocs->entries, node)
__of_changeset_entry_destroy(ce);
}
--- 8< ---
I already tested on my system and it works correctly with
device_link_wait_removal() only called from of_changeset_destroy()
as proposed.
Saravana, Nuno, Rob does it seems ok for you ?
..
> >
> > In general I hate these kinds of sequences that release a lock and
> > then grab it again quickly. It's not always a bug, but my personal
> > take on that is 90% of these introduce a bug.
> >
> > Drop the unlock/lock and we'll deal a deadlock if we actually hit one.
> > I'm also fairly certain that device_link_wait_removal() can't trigger
> > something else that can cause an OF overlay change while we are in the
> > middle of one. And like Rob said, I'm not sure this unlock/lock is a
> > good solution for that anyway.
>
> Totally agree. Unless we really see a deadlock this is a very bad idea (IMHO).
> Even on the PCI code, it seems to me that we're never destroying a changeset
> from a device/kobj_type release callback. That would be super weird right?
Convinced too.
I will drop the unlock/re-lock sequence in the next iteration of this series.
Best regards,
Hervé
--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Powered by blists - more mailing lists