linux-kernel - Re: [PATCH] mm, kasan: don't poison boot memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cb8564e8-3535-826b-2d42-b273a0d793fb@oracle.com>
Date:   Mon, 22 Feb 2021 13:42:56 -0500
From:   George Kennedy <george.kennedy@...cle.com>
To:     David Hildenbrand <david@...hat.com>,
        Andrey Konovalov <andreyknvl@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Vincenzo Frascino <vincenzo.frascino@....com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Konrad Rzeszutek Wilk <konrad@...nok.org>,
        Will Deacon <will.deacon@....com>,
        Andrey Ryabinin <aryabinin@...tuozzo.com>,
        Alexander Potapenko <glider@...gle.com>,
        Marco Elver <elver@...gle.com>,
        Peter Collingbourne <pcc@...gle.com>,
        Evgenii Stepanov <eugenis@...gle.com>,
        Branislav Rankov <Branislav.Rankov@....com>,
        Kevin Brodsky <kevin.brodsky@....com>,
        Christoph Hellwig <hch@...radead.org>,
        kasan-dev <kasan-dev@...glegroups.com>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        Linux Memory Management List <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Dhaval Giani <dhaval.giani@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>
Subject: Re: [PATCH] mm, kasan: don't poison boot memory



On 2/22/2021 11:13 AM, David Hildenbrand wrote:
> On 22.02.21 16:13, George Kennedy wrote:
>>
>>
>> On 2/22/2021 4:52 AM, David Hildenbrand wrote:
>>> On 20.02.21 00:04, George Kennedy wrote:
>>>>
>>>>
>>>> On 2/19/2021 11:45 AM, George Kennedy wrote:
>>>>>
>>>>>
>>>>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote:
>>>>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy
>>>>>> <george.kennedy@...cle.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote:
>>>>>>>> On 17.02.21 21:56, Andrey Konovalov wrote:
>>>>>>>>> During boot, all non-reserved memblock memory is exposed to the
>>>>>>>>> buddy
>>>>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot
>>>>>>>>> time,
>>>>>>>>> especially on systems with large amount of RAM. This patch makes
>>>>>>>>> page_alloc to not call kasan_free_pages() on all new memory.
>>>>>>>>>
>>>>>>>>> __free_pages_core() is used when exposing fresh memory during
>>>>>>>>> system
>>>>>>>>> boot and when onlining memory during hotplug. This patch adds 
>>>>>>>>> a new
>>>>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok()
>>>>>>>>> through
>>>>>>>>> free_pages_prepare() from __free_pages_core().
>>>>>>>>>
>>>>>>>>> This has little impact on KASAN memory tracking.
>>>>>>>>>
>>>>>>>>> Assuming that there are no references to newly exposed pages
>>>>>>>>> before they
>>>>>>>>> are ever allocated, there won't be any intended (but buggy)
>>>>>>>>> accesses to
>>>>>>>>> that memory that KASAN would normally detect.
>>>>>>>>>
>>>>>>>>> However, with this patch, KASAN stops detecting wild and large
>>>>>>>>> out-of-bounds accesses that happen to land on a fresh memory page
>>>>>>>>> that
>>>>>>>>> was never allocated. This is taken as an acceptable trade-off.
>>>>>>>>>
>>>>>>>>> All memory allocated normally when the boot is over keeps getting
>>>>>>>>> poisoned as usual.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Konovalov <andreyknvl@...gle.com>
>>>>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d
>>>>>>>> Not sure this is the right thing to do, see
>>>>>>>>
>>>>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Reversing the order in which memory gets allocated + used during
>>>>>>>> boot
>>>>>>>> (in a patch by me) might have revealed an invalid memory access
>>>>>>>> during
>>>>>>>> boot.
>>>>>>>>
>>>>>>>> I suspect that that issue would no longer get detected with your
>>>>>>>> patch, as the invalid memory access would simply not get detected.
>>>>>>>> Now, I cannot prove that :)
>>>>>>> Since David's patch we're having trouble with the iBFT ACPI table,
>>>>>>> which
>>>>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c".
>>>>>>> KASAN
>>>>>>> detects that it is being used after free when ibft_init() accesses
>>>>>>> the
>>>>>>> iBFT table, but as of yet we can't find where it get's freed (we've
>>>>>>> instrumented calls to kunmap()).
>>>>>> Maybe it doesn't get freed, but what you see is a wild or a large
>>>>>> out-of-bounds access. Since KASAN marks all memory as freed 
>>>>>> during the
>>>>>> memblock->page_alloc transition, such bugs can manifest as
>>>>>> use-after-frees.
>>>>>
>>>>> It gets freed and re-used. By the time the iBFT table is accessed by
>>>>> ibft_init() the page has been over-written.
>>>>>
>>>>> Setting page flags like the following before the call to kmap()
>>>>> prevents the iBFT table page from being freed:
>>>>
>>>> Cleaned up version:
>>>>
>>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>>>> index 0418feb..8f0a8e7 100644
>>>> --- a/drivers/acpi/osl.c
>>>> +++ b/drivers/acpi/osl.c
>>>> @@ -287,9 +287,12 @@ static void __iomem 
>>>> *acpi_map(acpi_physical_address
>>>> pg_off, unsigned long pg_sz)
>>>>
>>>>         pfn = pg_off >> PAGE_SHIFT;
>>>>         if (should_use_kmap(pfn)) {
>>>> +        struct page *page = pfn_to_page(pfn);
>>>> +
>>>>             if (pg_sz > PAGE_SIZE)
>>>>                 return NULL;
>>>> -        return (void __iomem __force *)kmap(pfn_to_page(pfn));
>>>> +        SetPageReserved(page);
>>>> +        return (void __iomem __force *)kmap(page);
>>>>         } else
>>>>             return acpi_os_ioremap(pg_off, pg_sz);
>>>>     }
>>>> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address
>>>> pg_off, void __iomem *vaddr)
>>>>         unsigned long pfn;
>>>>
>>>>         pfn = pg_off >> PAGE_SHIFT;
>>>> -    if (should_use_kmap(pfn))
>>>> -        kunmap(pfn_to_page(pfn));
>>>> -    else
>>>> +    if (should_use_kmap(pfn)) {
>>>> +        struct page *page = pfn_to_page(pfn);
>>>> +
>>>> +        ClearPageReserved(page);
>>>> +        kunmap(page);
>>>> +    } else
>>>>             iounmap(vaddr);
>>>>     }
>>>>
>>>> David, the above works, but wondering why it is now necessary. 
>>>> kunmap()
>>>> is not hit. What other ways could a page mapped via kmap() be 
>>>> unmapped?
>>>>
>>>
>>> Let me look into the code ... I have little experience with ACPI
>>> details, so bear with me.
>>>
>>> I assume that acpi_map()/acpi_unmap() map some firmware blob that is
>>> provided via firmware/bios/... to us.
>>>
>>> should_use_kmap() tells us whether
>>> a) we have a "struct page" and should kmap() that one
>>> b) we don't have a "struct page" and should ioremap.
>>>
>>> As it is a blob, the firmware should always reserve that memory region
>>> via memblock (e.g., memblock_reserve()), such that we either
>>> 1) don't create a memmap ("struct page") at all (-> case b) )
>>> 2) if we have to create e memmap, we mark the page PG_reserved and
>>>     *never* expose it to the buddy (-> case a) )
>>>
>>>
>>> Are you telling me that in this case we might have a memmap for the HW
>>> blob that is *not* PG_reserved? In that case it most probably got
>>> exposed to the buddy where it can happily get allocated/freed.
>>>
>>> The latent BUG would be that that blob gets exposed to the system like
>>> ordinary RAM, and not reserved via memblock early during boot.
>>> Assuming that blob has a low physical address, with my patch it will
>>> get allocated/used a lot earlier - which would mean we trigger this
>>> latent BUG now more easily.
>>>
>>> There have been similar latent BUGs on ARM boards that my patch
>>> discovered where special RAM regions did not get marked as reserved
>>> via the device tree properly.
>>>
>>> Now, this is just a wild guess :) Can you dump the page when mapping
>>> (before PageReserved()) and when unmapping, to see what the state of
>>> that memmap is?
>>
>> Thank you David for the explanation and your help on this,
>>
>> dump_page() before PageReserved and before kmap() in the above patch:
>>
>> [    1.116480] ACPI: Core revision 20201113
>> [    1.117628] XXX acpi_map: about to call kmap()...
>> [    1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0
>> mapping:0000000000000000 index:0x0 pfn:0xbe453
>> [    1.120381] flags: 0xfffffc0000000()
>> [    1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8
>> 0000000000000000
>> [    1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff
>> 0000000000000000
>> [    1.124146] page dumped because: acpi_map pre SetPageReserved
>>
>> I also added dump_page() before unmapping, but it is not hit. The
>> following for the same pfn now shows up I believe as a result of setting
>> PageReserved:
>>
>> [   28.098208] BUG:Bad page state in process mo dprobe pfn:be453
>> [   28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0
>> mapping:0000000000000000 index:0x1 pfn:0xbe453
>> [   28.098394] flags: 0xfffffc0001000(reserved)
>> [   28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122
>> 0000000000000000
>> [   28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff
>> 0000000000000000
>> [   28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
>> [   28.098394] page_owner info is not present (never set?)
>> [   28.098394] Modules linked in:
>> [   28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 
>> 5.11.0-3dbd5e3 #66
>> [   28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS 0.0.0 02/06/2015
>> [   28.098394] Call Trace:
>> [   28.098394]  dump_stack+0xdb/0x120
>> [   28.098394]  bad_page.cold.108+0xc6/0xcb
>> [   28.098394]  check_new_page_bad+0x47/0xa0
>> [   28.098394]  get_page_from_freelist+0x30cd/0x5730
>> [   28.098394]  ? __isolate_free_page+0x4f0/0x4f0
>> [   28.098394]  ? init_object+0x7e/0x90
>> [   28.098394]  __alloc_pages_nodemask+0x2d8/0x650
>> [   28.098394]  ? write_comp_data+0x2f/0x90
>> [   28.098394]  ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110
>> [   28.098394]  ? __sanitizer_cov_trace_pc+0x21/0x50
>> [   28.098394]  alloc_pages_vma+0xe2/0x560
>> [   28.098394]  do_fault+0x194/0x12c0
>> [   28.098394]  ? write_comp_data+0x2f/0x90
>> [   28.098394]  __handle_mm_fault+0x1650/0x26c0
>> [   28.098394]  ? copy_page_range+0x1350/0x1350
>> [   28.098394]  ? write_comp_data+0x2f/0x90
>> [   28.098394]  ? write_comp_data+0x2f/0x90
>> [   28.098394]  handle_mm_fault+0x1f9/0x810
>> [   28.098394]  ? write_comp_data+0x2f/0x90
>> [   28.098394]  do_user_addr_fault+0x6f7/0xca0
>> [   28.098394]  exc_page_fault+0xaf/0x1a0
>> [   28.098394]  asm_exc_page_fault+0x1e/0x30
>> [   28.098394] RIP: 0010:__clear_user+0x30/0x60
>
> I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that 
> someone is trying to allocate that page with the PG_reserved bit set. 
> This means that the page actually was exposed to the buddy.
>
> However, when you SetPageReserved(), I don't think that PG_buddy is 
> set and the refcount is 0. That could indicate that the page is on the 
> buddy PCP list. Could be that it is getting reused a couple of times.
>
> The PFN 0xbe453 looks a little strange, though. Do we expect ACPI 
> tables close to 3 GiB ? No idea. Could it be that you are trying to 
> map a wrong table? Just a guess.
>
>>
>> What would be  the correct way to reserve the page so that the above
>> would not be hit?
>
> I would have assumed that if this is a binary blob, that someone 
> (which I think would be acpi code) reserved via memblock_reserve() 
> early during boot.
>
> E.g., see drivers/acpi/tables.c:acpi_table_upgrade()->memblock_reserve().

acpi_table_upgrade() gets called, but bails out before 
memblock_reserve() is called. Thus, it appears no pages are getting 
reserved.

     503 void __init acpi_table_upgrade(void)
     504 {
     505         void *data;
     506         size_t size;
     507         int sig, no, table_nr = 0, total_offset = 0;
     508         long offset = 0;
     509         struct acpi_table_header *table;
     510         char cpio_path[32] = "kernel/firmware/acpi/";
     511         struct cpio_data file;
     512
     513         if 
(IS_ENABLED(CONFIG_ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD)) {
     514                 data = __initramfs_start;
     515                 size = __initramfs_size;
     516         } else {
     517                 data = (void *)initrd_start;
     518                 size = initrd_end - initrd_start;
     519         }
     520
     521         if (data == NULL || size == 0)
     522                 return;
     523
     524         for (no = 0; no < NR_ACPI_INITRD_TABLES; no++) {
     525                 file = find_cpio_data(cpio_path, data, size, 
&offset);
     526                 if (!file.data)
     527                         break;
...
     563                 all_tables_size += table->length;
     564                 acpi_initrd_files[table_nr].data = file.data;
     565                 acpi_initrd_files[table_nr].size = file.size;
     566                 table_nr++;
     567         }
     568         if (table_nr == 0)
     569                 return;                                 <-- 
bails out here
"drivers/acpi/tables.c"

George