lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4858258b-8fb9-b060-8724-25c89ba2ce0e@redhat.com>
Date:   Wed, 20 Oct 2021 09:05:10 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Jianyong Wu <jianyong.wu@....com>, akpm@...ux-foundation.org,
        mhiramat@...nel.org, peterz@...radead.org
Cc:     rostedt@...dmis.org, vbabka@...e.cz, linux-kernel@...r.kernel.org,
        anshuman.khandual@....com, justin.he@....com, nd@....com
Subject: Re: [PATCH v1] init: avoid race condition of update page table in
 kernel init

On 20.10.21 04:07, Jianyong Wu wrote:
> Race condition of page table update can happen in kernel_init as
> both of memory hotplug module init and the following mark_rodata_ro can
> update page table. The function excute flow chart is:
> 
> -------------------------
> kernel_init
>   kernel_init_freeable
>     ...
>       do_initcall
>         ...
>           module_init [A]

Nit: virtio-mem adds memory via a workqueue, not directly during
module_init.

> 
>   ...
>   mark_readonly
>     mark_rodata_ro [B]
> -------------------------
> [A] can contains memory hotplug init therefore both [A] and [B] can
> update page table at the same time that may lead to race. Here we
> introduce memory hotplug lock to guard mark_rodata_ro to avoid the race
> condition.
> 
> I catch the related error when test virtio-mem (a new memory hotplug
> driver) on arm64 and may be a potential bug for other arches.

Thanks for reporting, we should be able to trigger something similar
using ACPI, when hotplugging DIMMs just at the wrong point in time.

> 
> How to reproduce on arm64:
> (1) prepare a kernel with virtio-mem enabled on arm64
> (2) start a VM using Cloud Hypervisor[1] using the kernel above
> (3) hotplug memory, 20G in my case, with virtio-mem
> (4) reboot or load new kernel using kexec
> 
> Test for server times, you may find the error below:
> 
> [    1.131039] Unable to handle kernel paging request at virtual address fffffbfffda3b140
> [    1.134504] Mem abort info:
> [    1.135722]   ESR = 0x96000007
> [    1.136991]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    1.139189]   SET = 0, FnV = 0
> [    1.140467]   EA = 0, S1PTW = 0
> [    1.141755]   FSC = 0x07: level 3 translation fault
> [    1.143787] Data abort info:
> [    1.144976]   ISV = 0, ISS = 0x00000007
> [    1.146554]   CM = 0, WnR = 0
> [    1.147817] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000426f2000
> [    1.150551] [fffffbfffda3b140] pgd=0000000042ffd003, p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003, pte=0000000000000000
> [    1.155728] Internal error: Oops: 96000007 [#1] SMP
> [    1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C        5.15.0-rc3+ #100
> [    1.161002] Hardware name: linux,dummy-virt (DT)
> [    1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    1.165825] pc : alloc_init_pud+0x38c/0x550
> [    1.167610] lr : alloc_init_pud+0x394/0x550
> [    1.169358] sp : ffff80001001bd10
> ......
> [    1.200527] Call trace:
> [    1.201583]  alloc_init_pud+0x38c/0x550
> [    1.203218]  __create_pgd_mapping+0x94/0xe0
> [    1.204983]  update_mapping_prot+0x50/0xd8
> [    1.206730]  mark_rodata_ro+0x50/0x58
> [    1.208281]  kernel_init+0x3c/0x120
> [    1.209760]  ret_from_fork+0x10/0x20
> [    1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
> [    1.213856] ---[ end trace 59473413ffe3f52d ]---
> [    1.215850] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> 
> [1] https://github.com/cloud-hypervisor/cloud-hypervisor
> 
> Suggested-by: Anshuman Khandual <anshuman.khandual@....com>
> Signed-off-by: Jianyong Wu <jianyong.wu@....com>
> ---
>  init/main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/init/main.c b/init/main.c
> index 81a79a77db46..290c9882ba9e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1471,7 +1471,9 @@ static void mark_readonly(void)
>  		 * insecure pages which are W+X.
>  		 */
>  		rcu_barrier();
> +		get_online_mems();
>  		mark_rodata_ro();
> +		put_online_mems();

I feel like this should be handled accordingly in the arch code instead.
The arch code has to be able to deal with concurrent direct map
operations (e.g., concurrent secretmem updates, concurrent memory hotadd
/ hot removal).

I remember x86-64 serializes page table init using
&init_mm.page_table_lock and page table updates using the cpa_lock (see
arch/x86/mm/pat/set_memory.c).

The question would be: in which cases would we touch the same page
tables when adding memory while concurrently splitting such a page
table. On x86-64, IIRC, all that could happen is that we split a huge
direct mapping and replace it via a page table holding 4k mappings. This
cannot really conflict with memory hotadd, which works in at least 2 MiB
(sub-section) granularity. Two ranges couldn't silently overlap and
trigger such a race.

So I don't think something like that applies on x86-64: I've run endless
tests with CONFIG_VIRTIO_MEM=y over the years (instead of "m") and never
spotted something similar.

But why does it matter on arm64? Can you describe how the exact race
triggers?

-- 
Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ