linux-kernel - Re: [PATCH v1] init: avoid race condition of update page table in kernel init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4858258b-8fb9-b060-8724-25c89ba2ce0e@redhat.com>
Date:   Wed, 20 Oct 2021 09:05:10 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Jianyong Wu <jianyong.wu@....com>, akpm@...ux-foundation.org,
        mhiramat@...nel.org, peterz@...radead.org
Cc:     rostedt@...dmis.org, vbabka@...e.cz, linux-kernel@...r.kernel.org,
        anshuman.khandual@....com, justin.he@....com, nd@....com
Subject: Re: [PATCH v1] init: avoid race condition of update page table in
 kernel init

On 20.10.21 04:07, Jianyong Wu wrote:
> Race condition of page table update can happen in kernel_init as
> both of memory hotplug module init and the following mark_rodata_ro can
> update page table. The function excute flow chart is:
> 
> -------------------------
> kernel_init
>   kernel_init_freeable
>     ...
>       do_initcall
>         ...
>           module_init [A]

Nit: virtio-mem adds memory via a workqueue, not directly during
module_init.

> 
>   ...
>   mark_readonly
>     mark_rodata_ro [B]
> -------------------------
> [A] can contains memory hotplug init therefore both [A] and [B] can
> update page table at the same time that may lead to race. Here we
> introduce memory hotplug lock to guard mark_rodata_ro to avoid the race
> condition.
> 
> I catch the related error when test virtio-mem (a new memory hotplug
> driver) on arm64 and may be a potential bug for other arches.

Thanks for reporting, we should be able to trigger something similar
using ACPI, when hotplugging DIMMs just at the wrong point in time.

> 
> How to reproduce on arm64:
> (1) prepare a kernel with virtio-mem enabled on arm64
> (2) start a VM using Cloud Hypervisor[1] using the kernel above
> (3) hotplug memory, 20G in my case, with virtio-mem
> (4) reboot or load new kernel using kexec
> 
> Test for server times, you may find the error below:
> 
> [    1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140
> [    1.134504] Mem abort info:
> [    1.135722]   ESR = 0x96000007
> [    1.136991]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    1.139189]   SET = 0, FnV = 0
> [    1.140467]   EA = 0, S1PTW = 0
> [    1.141755]   FSC = 0x07: level 3 translation fault
> [    1.143787] Data abort info:
> [    1.144976]   ISV = 0, ISS = 0x00000007
> [    1.146554]   CM = 0, WnR = 0
> [    1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000
> [    1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000
> [    1.155728] Internal error: Oops: 96000007 [#1] SMP
> [    1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C        5.15.0-rc3+ #100
> [    1.161002] Hardware name: linux,dummy-virt (DT)
> [    1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    1.165825] pc : alloc_init_pud+0x38c/0x550
> [    1.167610] lr : alloc_init_pud+0x394/0x550
> [    1.169358] sp : ffff80001001bd10
> ......
> [    1.200527] Call trace:
> [    1.201583]  alloc_init_pud+0x38c/0x550
> [    1.203218]  __create_pgd_mapping+0x94/0xe0
> [    1.204983]  update_mapping_prot+0x50/0xd8
> [    1.206730]  mark_rodata_ro+0x50/0x58
> [    1.208281]  kernel_init+0x3c/0x120
> [    1.209760]  ret_from_fork+0x10/0x20
> [    1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
> [    1.213856] ---[ end trace 59473413ffe3f52d ]---
> [    1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> 
> [1] https://github.com/cloud-hypervisor/cloud-hypervisor
> 
> Suggested-by: Anshuman Khandual <anshuman.khandual@....com>
> Signed-off-by: Jianyong Wu <jianyong.wu@....com>
> ---
>  init/main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/init/main.c b/init/main.c
> index 81a79a77db46..290c9882ba9e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1471,7 +1471,9 @@ static void mark_readonly(void)
>  		 * insecure pages which are W+X.
>  		 */
>  		rcu_barrier();
> +		get_online_mems();
>  		mark_rodata_ro();
> +		put_online_mems();

I feel like this should be handled accordingly in the arch code instead.
The arch code has to be able to deal with concurrent direct map
operations (e.g., concurrent secretmem updates, concurrent memory hotadd
/ hot removal).

I remember x86-64 serializes page table init using
&init_mm.page_table_lock and page table updates using the cpa_lock (see
arch/x86/mm/pat/set_memory.c).

The question would be: in which cases would we touch the same page
tables when adding memory while concurrently splitting such a page
table. On x86-64, IIRC, all that could happen is that we split a huge
direct mapping and replace it via a page table holding 4k mappings. This
cannot really conflict with memory hotadd, which works in at least 2 MiB
(sub-section) granularity. Two ranges couldn't silently overlap and
trigger such a race.

So I don't think something like that applies on x86-64: I've run endless
tests with CONFIG_VIRTIO_MEM=y over the years (instead of "m") and never
spotted something similar.

But why does it matter on arm64? Can you describe how the exact race
triggers?

-- 
Thanks,

David / dhildenb