[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ede31cc0-5662-4ce1-b718-82ad2d84c417@arm.com>
Date: Fri, 3 Nov 2023 16:43:58 +0000
From: Robin Murphy <robin.murphy@....com>
To: Konrad Dybcio <konradybcio@...nel.org>,
Hector Martin <marcan@...can.st>, iommu@...ts.linux.dev,
Asahi Linux <asahi@...ts.linux.dev>,
LKML <linux-kernel@...r.kernel.org>
Cc: Janne Grunau <j@...nau.net>, Rob Herring <robh+dt@...nel.org>,
Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
Dmitry Baryshkov <dmitry.baryshkov@...aro.org>
Subject: Re: Race between of_iommu_configure() and iommu_probe_device()
On 2023-11-03 12:48 pm, Konrad Dybcio wrote:
>
>
> On 3.11.2023 12:55, Hector Martin wrote:
>> I just hit a crash in of_iommu_xlate() -> apple_dart_of_xlate() because
>> dev->iommu was NULL. of_iommu_xlate() first calls iommu_fwspec_init
>> which calls dev_iommu_get(), which allocates that member if NULL. That
>> means it got freed in between, but the only thing that can do that is
>> dev_iommu_free(), which is called from __iommu_probe_device() in the
>> error path. That is serialized via a static lock, but not against the
>> xlate stuff.
>>
>> I think the specific sequence of events was as follows:
>>
>> - IOMMU driver has not probed yet
>> - Device driver tries to probe, and gets deferred via of_iommu_xlate()
>> -> driver_deferred_probe_check_state() because there are no IOMMU ops yet
>> - IOMMU driver probes
>> - IOMMU driver registration triggers device probes
>> - IOMMU device probe fails, because there is no fwnode/OF data yet (e.g.
>> apple_dart_probe_device returns ENODEV if dev_iommu_priv_get() returns
>> NULL, and that is set in apple_dart_of_xlate())
>> - __iommu_probe_device is in the error exit path, and at this exact
>> point a parallel device probe is running of_iommu_xlate()
>> - of_iommu_xlate() calls iommu_fwspec_init(), which ensures dev->iommu
>> is non-NULL, which at this point it is
>> - immediately after that, __iommu_probe_device() calls dev_iommu_free()
>> since it is in the process of erroring out. This frees and sets
>> dev->iommu to NULL.
>> - of_iommu_xlate() calls ops->of_xlate()
>> - apple_dart_of_xlate() calls dev_iommu_priv_set(), which crashes
>> because dev->iommu is now NULL.
>>
>> As far as I can tell it's not just the specific driver xlate call
>> setting priv that's the problem here, but there is one big race between
>> the entire fwspec codepath (accessing dev->iommu->fwspec) and
>> __iommu_probe_device() (allocating and freeing dev->iommu).
>>
>> Thinking about this whole thing is making my brain hurt. Thoughts? How
>> do we fix this?
> FWIW I've been getting inexplicable boot-time crashes that sometimes
> spew out a fraction of a log line like:
>
> [x.yyyyyyyy] addr.iommu
>
> on some Qualcomm devices every now and then for quite some time..
> Not very common though. Might be this, might be something else..
Sounds likely to all be the same thing as here:
https://lore.kernel.org/linux-iommu/1698825902-10685-1-git-send-email-quic_zhenhuah@quicinc.com/
The true solution is to pull the of_xlate step into iommu_probe_device()
itself, which I'm working towards, and finally get rid of the horrible
"replay" logic which causes no end of problems.
Thanks,
Robin.
Powered by blists - more mailing lists