[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b4cd8daf-ef37-4cc1-546e-ba46cb19392a@codeaurora.org>
Date: Tue, 11 Aug 2020 17:24:08 +0530
From: Akash Asthana <akashast@...eaurora.org>
To: Doug Anderson <dianders@...omium.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
kgdb-bugreport@...ts.sourceforge.net,
Mukesh Savaliya <msavaliy@...eaurora.org>,
Andy Gross <agross@...nel.org>,
Bjorn Andersson <bjorn.andersson@...aro.org>,
Evan Green <evgreen@...omium.org>,
Jiri Slaby <jirislaby@...nel.org>,
linux-arm-msm <linux-arm-msm@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>, linux-serial@...r.kernel.org
Subject: Re: [PATCH] serial: qcom_geni_serial: Fix recent kdb hang
On 8/11/2020 2:56 AM, Doug Anderson wrote:
> Hi,
>
> On Mon, Aug 10, 2020 at 5:32 AM Akash Asthana <akashast@...eaurora.org> wrote:
>> Hi Doug,
>>
>> On 8/7/2020 10:49 AM, Douglas Anderson wrote:
>>> The commit e42d6c3ec0c7 ("serial: qcom_geni_serial: Make kgdb work
>>> even if UART isn't console") worked pretty well and I've been doing a
>>> lot of debugging with it. However, recently I typed "dmesg" in kdb
>>> and then held the space key down to scroll through the pagination. My
>>> device hung. This was repeatable and I found that it was introduced
>>> with the aforementioned commit.
>>>
>>> It turns out that there are some strange boundary cases in geni where
>>> in some weird situations it will signal RX_LAST but then will put 0 in
>>> RX_LAST_BYTE. This means that the entire last FIFO entry is valid.
>> IMO that means we received a word in RX_FIFO and it is the last word
>> hence RX_LAST bit is set.
> What you say would make logical sense, but it's not how I have
> observed geni to work. See below.
>
>
>> RX_LAST_BYTE is 0 means none of the bytes are valid in the last word.
> This would imply that qcom_geni_serial_handle_rx() is also broken
> though, wouldn't it? Specifically imagine that WORD_CNT is 1 and
> RX_LAST is set and RX_LAST_BYTE_VALID is true. Here's the logic from
> that function:
>
> total_bytes = BYTES_PER_FIFO_WORD * (word_cnt - 1);
> if (last_word_partial && last_word_byte_cnt)
> total_bytes += last_word_byte_cnt;
> else
> total_bytes += BYTES_PER_FIFO_WORD;
> port->handle_rx(uport, total_bytes, drop);
>
> As you can see that logic will set "total_bytes" to 4 in the case I'm
> talking about.
Yeah IMO as per theory this should also be corrected but since you have
already pulled out few experiment to prove garbage data issue(which I
was suspecting) is not seen.
It's already consistent with existing logic and it behaves well
practically . So the changes could be merge. Meanwhile I am checking
with HW team to get clarity.
>
>
>> In such scenario we should just read RX_FIFO buffer (to empty it),
>> discard the word and return NO_POLL_CHAR. Something like below.
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> else
>> private_data->poll_cached_bytes_cnt = 4;
>>
>> private_data->poll_cached_bytes =
>> readl(uport->membase + SE_GENI_RX_FIFOn);
>> }
>>
>> + if (!private_data->poll_cached_bytes_cnt)
>> + return NO_POLL_CHAR;
>> private_data->poll_cached_bytes_cnt--;
>> ret = private_data->poll_cached_bytes & 0xff;
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Please let me know whether above code helps.
> Your code will avoid the hang. Yes. ...but it will drop bytes. I
> devised a quick-n-dirty test. Here's a test of your code:
I assumed those as invalid bytes and don't wanted to read them so yeah
dropping of bytes was expected.
>
> https://crrev.com/c/2346886
>
> ...and here's a test of my code:
>
> https://crrev.com/c/2346884
>
> I had to keep a buffer around since it's hard to debug the serial
> driver. In both cases I put "DOUG" into the buffer when I detect this
> case. If my theory about how geni worked was wrong then we should
> expect to see some garbage in the buffer right after the DOUG, right?
> ...but my code gets the alphabet in nice sequence. Your code drops 4
> bytes.
Yeah I was expecting garbage data.
>
>
> NOTE: while poking around with the above two test patches I found it
> was pretty easy to get geni to drop bytes / hit overflow cases and
> also to insert bogus 0 bytes in the stream (I believe these are
> related). I was able to reproduce this:
> * With ${SUBJECT} patch in place.
> * With your proposed patch.
> * With the recent "geni" patches reverted (in other words back to 1
> byte per FIFO entry).
>
> It's not terribly surprising that we're overflowing since I believe
> kgdb isn't too keen to read characters at the same time it's writing.
> That doesn't explain the weird 0-bytes that geni seemed to be
> inserting, but at least it would explain the overflows. However, even
> after I fixed this I _still_ was getting problems. Specifically geni
> seemed to be hiding bytes from me until it was too late. I put
> logging in and would see this:
>
> 1 word in FIFO - wxyz
> 1 word in FIFO (last set, last FIFO has 1 byte) - \n
> Check again, still 0 bytes in FIFO
> Suddenly 16 bytes are in FIFO and S_RX_FIFO_WR_ERR_EN is set.
RX data first stored in RX_ASYNC_FIFO then it's transfered to RX_FIFO
When get_char is called and we observe 0 bytes in RX_FIFO, most probably
data is not transfered from RX_ASYNC_FIFO to RX_FIFO.
BITS 27:25 of SE_GENI_RX_FIFO_STATUS register shows RX_ASYNC_FIFO word
count.
>
> I spent a whole bunch of time poking at this and couldn't find any
> sort of workaround. Presumably geni is taking some time between me
> reading the last word out of the FIFO from the "previous" packet and
> then transitioning to the new packet. I found a lot of references to
> this process in the hardware register description (see GENI_CFG_REG69,
> for instance), but I couldn't manage to make the kick to happen any
> faster. Presumably this isn't a problem for things like Bluetooth
> since flow control saves them. ...and I guess this isn't a problem in
> practice because we usually _send_ a lot of data to the host for
> console/kgdb and it's only the host => DUT path that has problems.
>
>
>> I am not sure about what all scenario can leads to this behavior from
>> hardware, I will try to get an answer from hardware team.
>>
>> Any error bit was set for SE_GENI_S_IRQ_STATUS & SE_GENI_M_IRQ_STATUS
>> registers?
> As per above I can see overflows in my test case and geni seems to be
> behaving pretty badly. If you have ideas on how to fix this I'd love
> it. However, it still seems like my patch is right because (at least
> in the cases I tested) it avoids dropping bytes in some cases. It
> also matches how qcom_geni_serial_handle_rx() works and if that was
> broken we'd have noticed by now.
Reviewed-by: Akash Asthana <akashast@...eaurora.org>
>
>> I guess the hang was seen because *poll_cached_bytes_cnt* is unsigned
>> int and it's value was 0, when it's decremented by 1 it's value become
>> '4294967295' (very large) and dummy RX (0x00) would happen that
>>
>> many times before reading any actual RX transfers/bytes.
> Right. That would be why it was hanging.
>
>
> -Doug
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,\na Linux Foundation Collaborative Project
Powered by blists - more mailing lists