[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aab71610e11c2dd293159576cc53e277@codeaurora.org>
Date: Mon, 01 Nov 2021 09:35:42 -0700
From: rishabhb@...eaurora.org
To: Cristian Marussi <cristian.marussi@....com>
Cc: sudeep.holla@....com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, avajid@...eaurora.org,
adharmap@...eaurora.org
Subject: Re: [PATCH v3] firmware: arm_scmi: Free mailbox channels if probe
fails
On 2021-09-01 02:35, Cristian Marussi wrote:
> On Tue, Aug 31, 2021 at 06:48:35AM +0100, Cristian Marussi wrote:
>> On Mon, Aug 30, 2021 at 02:09:37PM -0700, rishabhb@...eaurora.org
>> wrote:
>> > Hi Christian
>>
>> Hi Rishabh,
>>
>> thanks for looking into this kind of bad interactions.
>>
>> > There seems to be another issue here. The response from agent can be delayed
>> > causing a timeout during base protocol acquire,
>> > which leads to the probe failure. What I have observed is sometimes the
>> > failure of probe and rx_callback (due to a delayed message)
>> > happens at the same time on different cpus.
>> > Because of this race, the device memory may be cleared while the
>> > interrupt(rx_callback) is executing on another cpu.
>>
>> You are right that concurrency was not handled properly in this kind
>> of
>> context and moreover, if you think about it, even the case of out of
>> order reception of responses and delayed_responses (type2 SCMI
>> messages)
>> for asynchronous SCMI commands was not handled properly.
>>
>> > How do you propose we solve this? Do you think it is better to take the
>> > setting up of base and other protocols out of probe and
>> > in some delayed work? That would imply the device memory is not released
>> > until remove is called. Or should we add locking to
>> > the interrupt handler(scmi_rx_callback) and the cleanup in probe to avoid
>> > the race?
>> >
>>
>> These issues were more easily exposed by SCMI Virtio transport, so in
>> the series where I introduced scmi-virtio:
>>
>> https://lore.kernel.org/linux-arm-kernel/162848483974.232214.9506203742448269364.b4-ty@arm.com/
>>
>> (which is now queued for v5.15 ... now on -next I think...finger
>> crossed)
>>
>> I took the chance to rectify a couple of other things in the SCMI core
>> in the initial commits.
>> As an example, in the above series
>>
>> [PATCH v7 05/15] firmware: arm_scmi: Handle concurrent and
>> out-of-order messages
>>
>> cares to add a refcount to xfers and some locking on xfers between TX
>> and RX path to avoid that a timed out xfer can vanish while the rx
>> path
>> is concurrently working on it (as you said); moreover I handle the
>> condition (rare if not unplausible anyway) in which a transport
>> delivers
>> out of order responses and delayed responses.
>>
>> I tested this scenarios on some fake emulated SCMI Virtio transport
>> where I could play any sort of mess and tricks to stress this limit
>> conditions, but you're more than welcome to verify if the race you are
>> seeing on Base protocol time out is solved (as I would hope :D) by
>> this
>> series of mine.
>>
>> Let me know, any feedback is welcome.
>>
>> Btw, in the series above there are also other minor changes, but there
>> is also another more radical change needed to ensure correctness and
>> protection against stale old messages which maybe could interest you
>> in general if you are looking into SCMI:
>>
>> [PATCH v7 04/15] firmware: arm_scmi: Introduce monotonically
>> increasing tokens
>>
>> Let me know if yo have other concerns.
>>
>
> Hi Rishabhb,
>
> just a quick remark, thinking again about your fail @probe scenario
> above
> I realized that while the concurrency patch I mentioned above could
> help on
> races against vanishing xfers when late timed-out responses are
> delivered,
> here we really are then also shutting down everything on failure, so
> there
> could be further issues between a very late invokation of
> scmi_rx_callback
> and the core devm_ helpers freeing the underlying xfer/cinfo/etc..
> structs
> used by scmi-rx-callback itself (maybe this was already what you meant
> and
> I didn't get it,...sorry)
>
> On the other side, I don't feel that delaying Base init to a deferred
> worker is a viable solution since we need Base protocol init to be
> initialized and we need to just give up if we cannot communicate with
> the SCMI platform fw in such early stages. (Base protocol is really the
> only mandatory proto is I remember correctly the spec)
>
> Currenly I'm off and only glancing at mails but I'll have a thought
> about
> these issues once back in a few weeks time.
>
> Thanks,
> Cristian
>
Hi Cristian
I hope you enjoyed your vacation. Did you get a chance to look at the
issue stated above
and have some idea as to how to solve this?
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@...ts.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Powered by blists - more mailing lists