linux-kernel - RE: [PATCH v1] init: avoid race condition of update page table in kernel init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM9PR08MB7276FEF0AE524A9728F63A4CF4BE9@AM9PR08MB7276.eurprd08.prod.outlook.com>
Date:   Wed, 20 Oct 2021 08:17:51 +0000
From:   Jianyong Wu <Jianyong.Wu@....com>
To:     David Hildenbrand <david@...hat.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "mhiramat@...nel.org" <mhiramat@...nel.org>,
        "peterz@...radead.org" <peterz@...radead.org>
CC:     "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Anshuman Khandual <Anshuman.Khandual@....com>,
        Justin He <Justin.He@....com>, nd <nd@....com>
Subject: RE: [PATCH v1] init: avoid race condition of update page table in
 kernel init

Hello David,

> -----Original Message-----
> From: David Hildenbrand <david@...hat.com>
> Sent: Wednesday, October 20, 2021 3:05 PM
> To: Jianyong Wu <Jianyong.Wu@....com>; akpm@...ux-foundation.org;
> mhiramat@...nel.org; peterz@...radead.org
> Cc: rostedt@...dmis.org; vbabka@...e.cz; linux-kernel@...r.kernel.org;
> Anshuman Khandual <Anshuman.Khandual@....com>; Justin He
> <Justin.He@....com>; nd <nd@....com>
> Subject: Re: [PATCH v1] init: avoid race condition of update page table in
> kernel init
> 
> On 20.10.21 04:07, Jianyong Wu wrote:
> > Race condition of page table update can happen in kernel_init as both
> > of memory hotplug module init and the following mark_rodata_ro can
> > update page table. The function excute flow chart is:
> >
> > -------------------------
> > kernel_init
> >   kernel_init_freeable
> >     ...
> >       do_initcall
> >         ...
> >           module_init [A]
> 
> Nit: virtio-mem adds memory via a workqueue, not directly during
> module_init.

Yes, maybe it's misleading. I want to say that the driver initialization, e.g. virtio-mem, the workqueue will be triggered before mark_rodata_ro.

> 
> >
> >   ...
> >   mark_readonly
> >     mark_rodata_ro [B]
> > -------------------------
> > [A] can contains memory hotplug init therefore both [A] and [B] can
> > update page table at the same time that may lead to race. Here we
> > introduce memory hotplug lock to guard mark_rodata_ro to avoid the
> > race condition.
> >
> > I catch the related error when test virtio-mem (a new memory hotplug
> > driver) on arm64 and may be a potential bug for other arches.
> 
> Thanks for reporting, we should be able to trigger something similar using
> ACPI, when hotplugging DIMMs just at the wrong point in time.
>
I think so, ACPI based memory hotplug may has the same issue. I'll test ACPI based dimm hotplug about the case.
 
> >
> > How to reproduce on arm64:
> > (1) prepare a kernel with virtio-mem enabled on arm64
> > (2) start a VM using Cloud Hypervisor[1] using the kernel above
> > (3) hotplug memory, 20G in my case, with virtio-mem
> > (4) reboot or load new kernel using kexec
> >
> > Test for server times, you may find the error below:
> >
> > [    1.131039] Unable to handle kernel paging request at virtual address
> fffffbfffda3b140
> > [    1.134504] Mem abort info:
> > [    1.135722]   ESR = 0x96000007
> > [    1.136991]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    1.139189]   SET = 0, FnV = 0
> > [    1.140467]   EA = 0, S1PTW = 0
> > [    1.141755]   FSC = 0x07: level 3 translation fault
> > [    1.143787] Data abort info:
> > [    1.144976]   ISV = 0, ISS = 0x00000007
> > [    1.146554]   CM = 0, WnR = 0
> > [    1.147817] swapper pgtable: 4k pages, 48-bit VAs,
> pgdp=00000000426f2000
> > [    1.150551] [fffffbfffda3b140] pgd=0000000042ffd003,
> p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003,
> pte=0000000000000000
> > [    1.155728] Internal error: Oops: 96000007 [#1] SMP
> > [    1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C        5.15.0-
> rc3+ #100
> > [    1.161002] Hardware name: linux,dummy-virt (DT)
> > [    1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> > [    1.165825] pc : alloc_init_pud+0x38c/0x550
> > [    1.167610] lr : alloc_init_pud+0x394/0x550
> > [    1.169358] sp : ffff80001001bd10
> > ......
> > [    1.200527] Call trace:
> > [    1.201583]  alloc_init_pud+0x38c/0x550
> > [    1.203218]  __create_pgd_mapping+0x94/0xe0
> > [    1.204983]  update_mapping_prot+0x50/0xd8
> > [    1.206730]  mark_rodata_ro+0x50/0x58
> > [    1.208281]  kernel_init+0x3c/0x120
> > [    1.209760]  ret_from_fork+0x10/0x20
> > [    1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
> > [    1.213856] ---[ end trace 59473413ffe3f52d ]---
> > [    1.215850] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> >
> > [1] https://github.com/cloud-hypervisor/cloud-hypervisor
> >
> > Suggested-by: Anshuman Khandual <anshuman.khandual@....com>
> > Signed-off-by: Jianyong Wu <jianyong.wu@....com>
> > ---
> >  init/main.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/init/main.c b/init/main.c index
> > 81a79a77db46..290c9882ba9e 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -1471,7 +1471,9 @@ static void mark_readonly(void)
> >  		 * insecure pages which are W+X.
> >  		 */
> >  		rcu_barrier();
> > +		get_online_mems();
> >  		mark_rodata_ro();
> > +		put_online_mems();
> 
> I feel like this should be handled accordingly in the arch code instead.
> The arch code has to be able to deal with concurrent direct map operations
> (e.g., concurrent secretmem updates, concurrent memory hotadd / hot
> removal).
> 
IMO, the race of page table update may happen here and affect all arches after virtio-mem workqueue is triggered.

> I remember x86-64 serializes page table init using &init_mm.page_table_lock
> and page table updates using the cpa_lock (see
> arch/x86/mm/pat/set_memory.c).
>
 
> The question would be: in which cases would we touch the same page tables
> when adding memory while concurrently splitting such a page table. On x86-
> 64, IIRC, all that could happen is that we split a huge direct mapping and
> replace it via a page table holding 4k mappings. This cannot really conflict
> with memory hotadd, which works in at least 2 MiB
> (sub-section) granularity. Two ranges couldn't silently overlap and trigger
> such a race.
> 
> So I don't think something like that applies on x86-64: I've run endless tests
> with CONFIG_VIRTIO_MEM=y over the years (instead of "m") and never
> spotted something similar.
> 
> But why does it matter on arm64? Can you describe how the exact race
> triggers?

I don't know much about how x86 does in memory map. Let me show you how the race happens on arm64.
When virtio-mem workqueue is triggered, arch_memory_add will be called where the related page table will be created. The call chain is arch_memory_add->__create_pgd_mapping->alloc_init_pud. As the memory add may take for serval seconds, it may be concurrent with mark_rodata_ro, in which page tables are created either.
The race can occur in alloc_init_pud. See below:
/***************************************************************************/
Virtio-mem workqueue thread                                                               mark_rodata_ro thread
{                                                                                                                         
...
        pudp = pud_set_fixmap_offset(p4dp, addr); // set fixmap
        do {
                pud_t old_pud = READ_ONCE(*pudp);                                         
...
        } while (pudp++, addr = next, addr != end);                                   pudp = pud_set_fixmap_offset;  //set fixmap                                
        pud_clear_fixmap();  // clear fixmap                                              do {
}                                                                                                                              pud_t old_pud = READ_ONCE(*pudp);//CRASH

/*****************************************************************************/
When create direct kernel memory map, fixmap, which is global, is used and should avoid being used in concurrency. Once race happen, crash may occur. That's why we can see in the kernel dump that the *pte* value is *0*.

Thanks
Jianyong

> 
> --
> Thanks,
> 
> David / dhildenb