lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 11 Aug 2020 17:24:08 +0530
From:   Akash Asthana <akashast@...eaurora.org>
To:     Doug Anderson <dianders@...omium.org>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        kgdb-bugreport@...ts.sourceforge.net,
        Mukesh Savaliya <msavaliy@...eaurora.org>,
        Andy Gross <agross@...nel.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Evan Green <evgreen@...omium.org>,
        Jiri Slaby <jirislaby@...nel.org>,
        linux-arm-msm <linux-arm-msm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>, linux-serial@...r.kernel.org
Subject: Re: [PATCH] serial: qcom_geni_serial: Fix recent kdb hang


On 8/11/2020 2:56 AM, Doug Anderson wrote:
> Hi,
>
> On Mon, Aug 10, 2020 at 5:32 AM Akash Asthana <akashast@...eaurora.org> wrote:
>> Hi Doug,
>>
>> On 8/7/2020 10:49 AM, Douglas Anderson wrote:
>>> The commit e42d6c3ec0c7 ("serial: qcom_geni_serial: Make kgdb work
>>> even if UART isn't console") worked pretty well and I've been doing a
>>> lot of debugging with it.  However, recently I typed "dmesg" in kdb
>>> and then held the space key down to scroll through the pagination.  My
>>> device hung.  This was repeatable and I found that it was introduced
>>> with the aforementioned commit.
>>>
>>> It turns out that there are some strange boundary cases in geni where
>>> in some weird situations it will signal RX_LAST but then will put 0 in
>>> RX_LAST_BYTE.  This means that the entire last FIFO entry is valid.
>> IMO that means we received a word in RX_FIFO and it is the last word
>> hence RX_LAST bit is set.
> What you say would make logical sense, but it's not how I have
> observed geni to work.  See below.
>
>
>> RX_LAST_BYTE is 0 means none of the bytes are valid in the last word.
> This would imply that qcom_geni_serial_handle_rx() is also broken
> though, wouldn't it?  Specifically imagine that WORD_CNT is 1 and
> RX_LAST is set and RX_LAST_BYTE_VALID is true.  Here's the logic from
> that function:
>
>    total_bytes = BYTES_PER_FIFO_WORD * (word_cnt - 1);
>    if (last_word_partial && last_word_byte_cnt)
>      total_bytes += last_word_byte_cnt;
>    else
>      total_bytes += BYTES_PER_FIFO_WORD;
>    port->handle_rx(uport, total_bytes, drop);
>
> As you can see that logic will set "total_bytes" to 4 in the case I'm
> talking about.

Yeah IMO as per theory this should also be corrected but since you have 
already pulled out few experiment to prove garbage data issue(which I  
was suspecting) is not seen.

It's already consistent with existing logic and it behaves well 
practically . So the changes could be merge. Meanwhile I am checking 
with HW team to get clarity.

>
>
>> In such scenario we should just read RX_FIFO buffer (to empty it),
>> discard the word and return NO_POLL_CHAR. Something like below.
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>                   else
>>                           private_data->poll_cached_bytes_cnt = 4;
>>
>>                   private_data->poll_cached_bytes =
>>                           readl(uport->membase + SE_GENI_RX_FIFOn);
>>           }
>>
>> +        if (!private_data->poll_cached_bytes_cnt)
>> +              return NO_POLL_CHAR;
>>           private_data->poll_cached_bytes_cnt--;
>>           ret = private_data->poll_cached_bytes & 0xff;
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Please let me know whether above code helps.
> Your code will avoid the hang.  Yes.  ...but it will drop bytes.  I
> devised a quick-n-dirty test.  Here's a test of your code:
I assumed those as invalid bytes and don't wanted to read them so yeah 
dropping of bytes was expected.
>
> https://crrev.com/c/2346886
>
> ...and here's a test of my code:
>
> https://crrev.com/c/2346884
>
> I had to keep a buffer around since it's hard to debug the serial
> driver.  In both cases I put "DOUG" into the buffer when I detect this
> case.  If my theory about how geni worked was wrong then we should
> expect to see some garbage in the buffer right after the DOUG, right?
> ...but my code gets the alphabet in nice sequence.  Your code drops 4
> bytes.
Yeah I was expecting garbage data.
>
>
> NOTE: while poking around with the above two test patches I found it
> was pretty easy to get geni to drop bytes / hit overflow cases and
> also to insert bogus 0 bytes in the stream (I believe these are
> related).  I was able to reproduce this:
> * With ${SUBJECT} patch in place.
> * With your proposed patch.
> * With the recent "geni" patches reverted (in other words back to 1
> byte per FIFO entry).
>
> It's not terribly surprising that we're overflowing since I believe
> kgdb isn't too keen to read characters at the same time it's writing.
> That doesn't explain the weird 0-bytes that geni seemed to be
> inserting, but at least it would explain the overflows.  However, even
> after I fixed this I _still_ was getting problems.  Specifically geni
> seemed to be hiding bytes from me until it was too late.  I put
> logging in and would see this:
>
> 1 word in FIFO - wxyz
> 1 word in FIFO (last set, last FIFO has 1 byte) - \n
> Check again, still 0 bytes in FIFO
> Suddenly 16 bytes are in FIFO and S_RX_FIFO_WR_ERR_EN is set.

RX data first stored in RX_ASYNC_FIFO then it's transfered to RX_FIFO

When get_char is called and we observe 0 bytes in RX_FIFO, most probably 
data is not transfered from RX_ASYNC_FIFO to RX_FIFO.

BITS 27:25 of SE_GENI_RX_FIFO_STATUS register shows RX_ASYNC_FIFO word 
count.


>
> I spent a whole bunch of time poking at this and couldn't find any
> sort of workaround.  Presumably geni is taking some time between me
> reading the last word out of the FIFO from the "previous" packet and
> then transitioning to the new packet.  I found a lot of references to
> this process in the hardware register description (see GENI_CFG_REG69,
> for instance), but I couldn't manage to make the kick to happen any
> faster.  Presumably this isn't a problem for things like Bluetooth
> since flow control saves them.  ...and I guess this isn't a problem in
> practice because we usually _send_ a lot of data to the host for
> console/kgdb and it's only the host => DUT path that has problems.
>
>
>> I am not sure about what all scenario can leads to this behavior from
>> hardware, I will try to get an answer from hardware team.
>>
>> Any error bit was set for SE_GENI_S_IRQ_STATUS & SE_GENI_M_IRQ_STATUS
>> registers?
> As per above I can see overflows in my test case and geni seems to be
> behaving pretty badly.  If you have ideas on how to fix this I'd love
> it.  However, it still seems like my patch is right because (at least
> in the cases I tested) it avoids dropping bytes in some cases.  It
> also matches how qcom_geni_serial_handle_rx() works and if that was
> broken we'd have noticed by now.

Reviewed-by: Akash Asthana <akashast@...eaurora.org>


>
>> I guess the hang was seen because *poll_cached_bytes_cnt* is unsigned
>> int and it's value was 0, when it's decremented by 1 it's value become
>> '4294967295' (very large) and dummy RX (0x00) would happen that
>>
>> many times before reading any actual RX transfers/bytes.
> Right.  That would be why it was hanging.
>
>
> -Doug

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,\na Linux Foundation Collaborative Project

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ