linux-kernel - Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <32799846-b8f0-758f-32eb-a9ce435e0b79@amlogic.com>
Date:   Wed, 27 Mar 2019 16:53:22 +0800
From:   Liang Yang <liang.yang@...ogic.com>
To:     Martin Blumenstingl <martin.blumenstingl@...glemail.com>
CC:     Matthew Wilcox <willy@...radead.org>, <mhocko@...e.com>,
        <linux@...linux.org.uk>, <linux-kernel@...r.kernel.org>,
        <rppt@...ux.ibm.com>, <linux-mm@...ck.org>,
        <linux-mtd@...ts.infradead.org>,
        <linux-amlogic@...ts.infradead.org>, <akpm@...ux-foundation.org>,
        <linux-arm-kernel@...ts.infradead.org>
Subject: Re: 32-bit Amlogic (ARM) SoC: kernel BUG in kfree()

Hi Martin,

Thanks a lot.
On 2019/3/26 2:31, Martin Blumenstingl wrote:
> Hi Liang,
> 
> On Mon, Mar 25, 2019 at 11:03 AM Liang Yang <liang.yang@...ogic.com> wrote:
>>
>> Hi Martin,
>>
>> On 2019/3/23 5:07, Martin Blumenstingl wrote:
>>> Hi Matthew,
>>>
>>> On Thu, Mar 21, 2019 at 10:44 PM Matthew Wilcox <willy@...radead.org> wrote:
>>>>
>>>> On Thu, Mar 21, 2019 at 09:17:34PM +0100, Martin Blumenstingl wrote:
>>>>> Hello,
>>>>>
>>>>> I am experiencing the following crash:
>>>>>     ------------[ cut here ]------------
>>>>>     kernel BUG at mm/slub.c:3950!
>>>>
>>>>           if (unlikely(!PageSlab(page))) {
>>>>                   BUG_ON(!PageCompound(page));
>>>>
>>>> You called kfree() on the address of a page which wasn't allocated by slab.
>>>>
>>>>> I have traced this crash to the kfree() in meson_nfc_read_buf().
>>>>> my observation is as follows:
>>>>> - meson_nfc_read_buf() is called 7 times without any crash, the
>>>>> kzalloc() call returns 0xe9e6c600 (virtual address) / 0x29e6c600
>>>>> (physical address)
>>>>> - the eight time meson_nfc_read_buf() is called kzalloc() call returns
>>>>> 0xee39a38b (virtual address) / 0x2e39a38b (physical address) and the
>>>>> final kfree() crashes
>>>>> - changing the size in the kzalloc() call from PER_INFO_BYTE (= 8) to
>>>>> PAGE_SIZE works around that crash
>>>>
>>>> I suspect you're doing something which corrupts memory.  Overrunning
>>>> the end of your allocation or something similar.  Have you tried KASAN
>>>> or even the various slab debugging (eg redzones)?
>>> KASAN is not available on 32-bit ARM. there was some progress last
>>> year [0] but it didn't make it into mainline. I tried to make the
>>> patches apply again and got it to compile (and my kernel is still
>>> booting) but I have no idea if it's still working. for anyone
>>> interested, my patches are here: [1] (I consider this a HACK because I
>>> don't know anything about the code which is being touched in the
>>> patches, I only made it compile)
>>>
>>> SLAB debugging (redzones) were a great hint, thank you very much for
>>> that Matthew! I enabled:
>>>     CONFIG_SLUB_DEBUG=y
>>>     CONFIG_SLUB_DEBUG_ON=y
>>> and with that I now get "BUG kmalloc-64 (Not tainted): Redzone
>>> overwritten" (a larger kernel log extract is attached).
>>>
>>> I'm starting to wonder if the NAND controller (hardware) writes more
>>> than 8 bytes.
>>> some context: the "info" buffer allocated in meson_nfc_read_buf is
>>> then passed to the NAND controller IP (after using dma_map_single).
>>>
>>> Liang, how does the NAND controller know that it only has to send
>>> PER_INFO_BYTE (= 8) bytes when called from meson_nfc_read_buf? all
>>> other callers of meson_nfc_dma_buffer_setup (which passes the info
>>> buffer to the hardware) are using (nand->ecc.steps * PER_INFO_BYTE)
>>> bytes?
>>>
>> NFC_CMD_N2M and CMDRWGEN are different commands. CMDRWGEN needs to set
>> the ecc page size (1KB or 512B) and Pages(2, 4, 8, ...), so
>> PER_INFO_BYTE(= 8) bytes for each ecc page.
>> I have never used NFC_CMD_N2M to transfer data before, because it is
>> very low efficient. And I do a experiment with the attachment and find
>> on overwritten on my meson axg platform.
>>
>> Martin, I would appreciate it very much if you would try the attachment
>> on your meson m8b platform.
> thank you for your debug patch! on my board 2 * PER_INFO_BYTE is not enough.
> I took the idea from your patch and adapted it so I could print a
> buffer with 256 bytes (which seems to be "big enough" for my board).
it only needs PER_INFO_BYTE (= 8) bytes, because NFC_CMD_N2M don't set 
*Pages*, that is not like CMDRWGEN which needs Pages*PER_INFO_BYTE (= 8) 
  bytes when setting *Pages* parameter. I have been thinking that 
NFC_CMD_N2M  only occupis PER_INFO_BYTE (= 8) bytes. And i have tried to 
not set the info address, the machine would crash.
> see the attached, modified patch
> 
> in the output I see that sometimes the first 32 bytes are not touched
> by the controller, but everything beyond 32 bytes is modified in the
> info buffer.
> 
it really makes sense that the controller sometimes fills the space 
beyond the first 8 bytes. However i expect the controller should only 
take the first 8 bytes when using NFC_CMD_N2M.
> I also tried to increase the buffer size to 512, but that didn't make
> a difference (I never saw any info buffer modification beyond 256
> bytes).
> 
> also I just noticed that I didn't give you much details on my NAND chip yet.
> from Amlogic vendor u-boot on Meson8m2 (all my Meson8b boards have
> eMMC flash, but I believe the NAND controller on Meson8 to GXBB is
> identical):
>    m8m2_n200_v1#amlnf chipinfo
>    flash  info
>    name:B revision 20nm NAND 8GiB H27UCG8T2B, id:ad de 94 eb 74 44  0  0
>    pagesize:0x4000, blocksize:0x400000, oobsize:0x500, chipsize:0x2000,
>      option:0x8, T_REA:16, T_RHOH:15
>    hw controller info
>    chip_num:1, onfi_mode:0, page_shift:14, block_shift:22, option:0xc2
>    ecc_unit:1024, ecc_bytes:70, ecc_steps:16, ecc_max:40
>    bch_mode:5, user_mode:2, oobavail:32, oobtail:64384
> 
I don't think it is caused by a different NAND type, but i have followed 
the some test on my GXL platform. we can see the result from the 
attachment. By the way, i don't find any information about this on meson 
NFC datasheet, so i will ask our VLSI.
Martin, May you reproduce it with the new patch on meson8b platform ? I 
need a more clear and easier compared log like gxl.txt. Thanks.

> 
> Regards
> 
> Martin
> 

View attachment "nand_debug.diff" of type "text/plain" (3399 bytes)

View attachment "gxl.txt" of type "text/plain" (26417 bytes)