[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1326bf1d-8289-2838-e2bd-48dba78b4a6c@linux.intel.com>
Date: Fri, 2 Aug 2019 11:41:05 -0500
From: Pierre-Louis Bossart <pierre-louis.bossart@...ux.intel.com>
To: Vinod Koul <vkoul@...nel.org>
Cc: alsa-devel@...a-project.org, linux-kernel@...r.kernel.org,
tiwai@...e.de, broonie@...nel.org, gregkh@...uxfoundation.org,
jank@...ence.com, srinivas.kandagatla@...aro.org,
slawomir.blauciak@...el.com, Sanyog Kale <sanyog.r.kale@...el.com>
Subject: Re: [RFC PATCH 15/40] soundwire: cadence_master: handle multiple
status reports per Slave
On 8/2/19 11:01 AM, Vinod Koul wrote:
> On 02-08-19, 10:29, Pierre-Louis Bossart wrote:
>> On 8/2/19 7:20 AM, Vinod Koul wrote:
>>> On 25-07-19, 18:40, Pierre-Louis Bossart wrote:
>
>>>> + status[i] = SDW_SLAVE_UNATTACHED;
>>>> + break;
>>>> + case 1:
>>>> + status[i] = SDW_SLAVE_ATTACHED;
>>>> + break;
>>>> + case 2:
>>>> + status[i] = SDW_SLAVE_ALERT;
>>>> + break;
>>>> + default:
>>>> + status[i] = SDW_SLAVE_RESERVED;
>>>> + break;
>>>> + }
>>>
>>> we have same logic in the code block preceding this, maybe good idea to
>>> write a helper and use for both
>>
>> Yes, I am thinking about this. There are multiple cases where we want to
>> re-check the status and clear some bits, so helpers would be good.
>>
>>>
>>> Also IIRC we can have multiple status set right?
>>
>> Yes, the status bits are sticky and mirror all values reported in PING
>> frames. I am still working on how to clear those bits, there are cases where
>> we clear bits and end-up never hearing from that device ever again. classic
>> edge/level issue I suppose.
>
> Then the case logic above doesn't work, it should be like the code block
> preceding this..
what I was referring to already is a problem even in single status mode.
Let's say for example that a device shows up as Device0, then we try to
enumerate it and program a non-zero device number. If that fails, we
currently clear the Attached status for Device0, so will never have an
interrupt ever again. The device is there, attached as Device0, but
we've lost the single opportunity to make it usable. The link is in most
cases going to be extremely reliable, but if we know of state machines
that lead to a terminal state then we should proactively have a recovery
mechanism to avoid complicated debug down the road for cases where the
hardware has transient issues.
For the multiple status case, we will have to look at the details and
figure out which of the flags get cleared and which ones don't.
One certainty is that we absolutely have to track IO errors in interrupt
context. They are recoverable in regular context but not quite in
interrupt context if we clear the status bits unconditionally.
Maybe a tangent here but to be transparent there are really multiple
topics we are tracking at the moment:
1. error handling in interrupts. I found a leak where if a device goes
in the weeds while we program its device number and resynchronizes then
we allocate a new device number instead of reusing the initial one. The
bit clearing is also to be checked as explained above.
2. module dependencies: there is a race condition leading to a kernel
oops if the Slave probe is not complete before the .update_status is
invoked.
3. jack detection. The jack detection routine is called as a result of
an imp-def Slave interrupt. We never documented the assumption that if
this jack detection takes time then it needs to be done offline, e.g. in
a work queue. Or if we still want it to be done in a the interrupt
thread then we need to re-enable interrupts earlier, otherwise one
device can stop interrupt handling for a fairly long duration.
4. streaming stop on link errors. We've seen in tests that if you reset
the link or a Slave device with debugfs while audio is playing then
streaming continues. This condition could happen if a device loses sync,
and the spec says the Slave needs to reset its channel enable bits. At
the command level, we handle this situation and will recover, but there
is no notification to the ALSA layers to try and recover on the PCM side
of things (as if it were an underflow condition). We also try to disable
a stream but get all kinds of errors since it's lost state.
All of those points are corner cases but they are important to solve for
actual products.
Powered by blists - more mailing lists