linux-kernel - Re: [PATCH v3] perf/arm-dmc620: Fix dmc620_pmu_irqs_lock/cpu_hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230804165943.GG30679@willie-the-truck>
Date:   Fri, 4 Aug 2023 17:59:44 +0100
From:   Will Deacon <will@...nel.org>
To:     Waiman Long <longman@...hat.com>
Cc:     Mark Rutland <mark.rutland@....com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        Robin Murphy <robin.murphy@....com>
Subject: Re: [PATCH v3] perf/arm-dmc620: Fix
 dmc620_pmu_irqs_lock/cpu_hotplug_lock circular lock dependency

On Fri, Aug 04, 2023 at 12:51:47PM -0400, Waiman Long wrote:
> On 8/4/23 12:28, Will Deacon wrote:
> > > > >    struct dmc620_pmu {
> > > > > @@ -423,9 +424,14 @@ static struct dmc620_pmu_irq *__dmc620_pmu_get_irq(int irq_num)
> > > > >    	struct dmc620_pmu_irq *irq;
> > > > >    	int ret;
> > > > > -	list_for_each_entry(irq, &dmc620_pmu_irqs, irqs_node)
> > > > > -		if (irq->irq_num == irq_num && refcount_inc_not_zero(&irq->refcount))
> > > > > +	list_for_each_entry(irq, &dmc620_pmu_irqs, irqs_node) {
> > > > > +		if (irq->irq_num != irq_num)
> > > > > +			continue;
> > > > > +		if (!irq->valid)
> > > > > +			return ERR_PTR(-EAGAIN);	/* Try again later */
> > > > It looks like this can bubble up to the probe() routine. Does the driver
> > > > core handle -EAGAIN coming back from a probe routine?
> > > Right, I should add code to handle this error condition. I think it can be
> > > handled in dmc620_pmu_get_irq(). The important thing is to release the
> > > mutex, wait a few ms and try again. What do you think?
> > I don't really follow, but waiting a few ms and trying again sounds like
> > a really nasty hack for something which doesn't appear to be constrained
> > by broken hardware. In other words, we got ourselves into this mess, so
> > we should be able to resolve it properly.
> 
> From my point of view, the proper way to solve the problem is to reverse the
> locking order. Since you don't to add a EXPORT statement to the core kernel
> code, we will have to find a way around it by not holding the
> dmc620_pmu_irqs_lock when cpuhp_state_add_instance_nocalls() is called.
> Another alternative that I can think of is to add one more mutex that we
> will hold just for the entirety of  __dmc620_pmu_get_irq() and take
> dmc620_pmu_irqs_lock only when the linked list is being modified. That will
> eliminate the need to introduce arbitrary wait as other caller of
> __dmc620_pmu_get_irq() will wait in the new mutex. Will this work for you?

Yes. To be honest, I think we've both spent far too much time trying to
fix this (and I admire your persistence!), so adding a mutex to make it
"obviously" correct sounds like the right thing to me. We can look at
optimisations later if anybody cares.

Cheers,

Will