lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <33372b6e-853f-e7d2-9785-167e2218947c@gmail.com>
Date:   Tue, 27 Jun 2023 23:13:11 +0800
From:   Song Shuai <suagrfillet@...il.com>
To:     Alexandre Ghiti <alexghiti@...osinc.com>
Cc:     paul.walmsley@...ive.com, palmer@...belt.com,
        aou@...s.berkeley.edu, robh+dt@...nel.org, frowand.list@...il.com,
        ajones@...tanamicro.com, mpe@...erman.id.au, arnd@...db.de,
        rppt@...nel.org, samuel@...lland.org, panqinglin2020@...as.ac.cn,
        conor.dooley@...rochip.com, anup@...infault.org,
        xianting.tian@...ux.alibaba.com, anshuman.khandual@....com,
        heiko@...ech.de, linux-riscv@...ts.infradead.org,
        linux-kernel@...r.kernel.org, devicetree@...r.kernel.org
Subject: Re: [PATCH V1 0/3] Revert huge-paged linear mapping and its related
 fixups

Hi Alex,

在 2023/6/27 19:47, Alexandre Ghiti 写道:
> On Sun, Jun 25, 2023 at 10:36 PM Alexandre Ghiti <alexghiti@...osinc.com> wrote:
>>
>> Hi Song,
>>
>> On Sun, Jun 25, 2023 at 4:10 PM Song Shuai <songshuaishuai@...ylab.org> wrote:
>>>
>>> We have encountered these two issues about huge-paged linear mapping since v6.4-rc1:
>>>
>>> 1. Bug report: kernel paniced when system hibernates[1]
>>>
>>> OpenSbi [v0.8,v1.3) set the PMP regions as !no-map, and the commit 3335068f8721
>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") mapped them in linear mapping.
>>> The hibernation process attempted to save/restore these mapped regions resulting in access fault.
>>>
>>> This issue was temporarily fixed by commit ed309ce52218 ("RISC-V: mark hibernation as nonportable").
>>> But as Alex's RFC and Rob's response stats in another thread [2] ,
>>> "Hibernation is only one case. Speculative accesses could also occur."
>>> So this fixing commit seems not the perfect answer to this issue.
>>>
>>>
>>> 2. Bug report: kernel paniced while booting (with UEFI )[3]
>>>
>>> During the booting with UEFI, UEFI Memory Mapping overwrote the memblock.
>>> The phys_ram_base was set as the end address of mmoderes0 (like 0x80040000 for 256 KiB mmoderes0@...00000),
>>> which resulted the VA based on 2M-aligned PA was not 2M-aligned using va_pa_offset
>>> (PAGE_OFFSET - phys_ram_base) to translate.
>>>
>>> The best_map_size() from commit 3335068f8721 didn't check the virtual alignment
>>> before choosing a map size. and then a "VA hole" was created where page faults always occurred.
>>>
>>> This issue was fixed by commit 49a0a3731596 ("riscv: Check the virtual alignment before choosing a map size"),
>>> But this fixing commit has a side-effect ("the possible third one" as Alex said in this thread).
>>> There are numerous PTE allocations slowing down the boot time and consuming some system memory when UEFI booting
>>> (Note that it's not involved when booting directly with OpenSbi, where phys_ram_base is the 2M-aligned base of DRAM).
>>>
>>> In my test, compared with/out reverting both commit 49a0a3731596 and commit 3335068f8721,
>>> I must wait ~20s for the linear mapping creation and mem_init_print_info() reported ~8M extra reserved memory.
>>
>> Indeed, phys_ram_base is not aligned on a 2MB boundary when booting
>> with EDK2, IIRC that's because the first chunk of memory at
>> 0x8000_0000 is marked as UC and is then completely evicted.
>>
>>>
>>> To eliminate this side-effect, We should find a way to align VA and PA on a 2MB boundary.
>>> The simplest way is reverting the commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping").
>>>
>>
>> I disagree, the simplest way is to align phys_ram_base on a 2MB
>> boundary, either by aligning to the upper boundary (and then wastes
>> memory, like we used to) or by aligning to the lower boundary (but I
>> want to make sure it works).
>>
>> I'll come up with something tomorrow.
> 
> @Song Shuai : can you test the following please?
> 
> commit a35b5e5e3f29e415f54fea064176315e79e21fb7 (HEAD ->
> dev/alex/align_va_pa_v1)
> Author: Alexandre Ghiti <alexghiti@...osinc.com>
> Date:   Mon Jun 5 14:26:55 2023 +0000
> 
>      riscv: Start of DRAM should at least be aligned on PMD size for
> the direct mapping
> 
>      So that we do not end up mapping the whole linear mapping using 4K
>      pages, which is slow at boot time, and also very likely at runtime.
> 
>      So make sure we align the start of DRAM on a PMD boundary.
> 
>      Signed-off-by: Alexandre Ghiti <alexghiti@...osinc.com>
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 4fa420faa780..4a43ec275c6d 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -214,8 +214,13 @@ static void __init setup_bootmem(void)
>          memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
> 
>          phys_ram_end = memblock_end_of_DRAM();
> +
> +       /*
> +        * Make sure we align the start of the memory on a PMD boundary so that
> +        * at worst, we map the linear mapping with PMD mappings.
> +        */
>          if (!IS_ENABLED(CONFIG_XIP_KERNEL))
> -               phys_ram_base = memblock_start_of_DRAM();
> +               phys_ram_base = memblock_start_of_DRAM() & PMD_MASK;
> 
>          /*
>           * In 64-bit, any use of __va/__pa before this point is wrong as we
> 
I tested your patch and it really fixed what I concerned :

`There are numerous PTE allocations slowing down the boot time and 
consuming some system memory when UEFI booting.`

FYI, I posted the `ptdmp` and the available memory with/out the patch 
from my test:

```
 >> v6.4

---[ Linear mapping ]---
0xff60000000000000-0xff600000001c0000    0x0000000080040000      1792K 
PTE     D A G . . W R V
0xff600000001c0000-0xff60000000bc0000    0x0000000080200000        10M 
PTE     D A G . . . R V
0xff60000000bc0000-0xff60000000fc0000    0x0000000080c00000         4M 
PTE     D A G . . W R V
0xff60000000fc0000-0xff600000015c0000    0x0000000081000000         6M 
PTE     D A G . . . R V
0xff600000015c0000-0xff600000ffdfd000    0x0000000081600000   4169972K 
PTE     D A G . . W R V
0xff600000fffbf000-0xff600000fffc0000    0x000000017ffff000         4K 
PTE     D A G . . W R V

 >> v6.4 with the patch

---[ Linear mapping ]---
0xff60000000040000-0xff60000000200000    0x0000000080040000      1792K 
PTE     D A G . . W R V
0xff60000000200000-0xff60000000c00000    0x0000000080200000        10M 
PMD     D A G . . . R V
0xff60000000c00000-0xff60000001000000    0x0000000080c00000         4M 
PMD     D A G . . W R V
0xff60000001000000-0xff60000001600000    0x0000000081000000         6M 
PMD     D A G . . . R V
0xff60000001600000-0xff60000040000000    0x0000000081600000      1002M 
PMD     D A G . . W R V
0xff60000040000000-0xff600000c0000000    0x00000000c0000000         2G 
PUD     D A G . . W R V
0xff600000c0000000-0xff600000ffe00000    0x0000000140000000      1022M 
PMD     D A G . . W R V
0xff600000ffe00000-0xff600000ffe3d000    0x000000017fe00000       244K 
PTE     D A G . . W R V
0xff600000fffff000-0xff60000100000000    0x000000017ffff000         4K 
PTE     D A G . . W R V
```

```
 >> v6.4

Memory: 3945340K/4194048K available (8391K kernel code, 4959K rwdata, 
4096K rodata, 2195K init, 476K bss, 248708K reserved, 0K cma-reserved)

 >> v6.4 with the patch

Memory: 3953524K/4194048K available (8391K kernel code, 4959K rwdata, 
4096K rodata, 2195K init, 476K bss, 240524K reserved, 0K cma-reserved)
```
> An example of output when phys_ram_base is not aligned on a 2MB boundary:
> 
> ---[ Linear mapping ]---
> 0xffffaf8000008000-0xffffaf8000200000    0x0000000080008000      2016K
> PTE     D A G . . W R V
> 0xffffaf8000200000-0xffffaf8000e00000    0x0000000080200000        12M
> PMD     D A G . . . R V
> 0xffffaf8000e00000-0xffffaf8001200000    0x0000000080e00000         4M
> PMD     D A G . . W R V
> 0xffffaf8001200000-0xffffaf8001a00000    0x0000000081200000         8M
> PMD     D A G . . . R V
> 0xffffaf8001a00000-0xffffaf807e600000    0x0000000081a00000      1996M
> PMD     D A G . . W R V
> 0xffffaf807e600000-0xffffaf807e714000    0x00000000fe600000      1104K
> PTE     D A G . . W R V
> 0xffffaf807e715000-0xffffaf807e718000    0x00000000fe715000        12K
> PTE     D A G . . W R V
> 0xffffaf807e71b000-0xffffaf807e71c000    0x00000000fe71b000         4K
> PTE     D A G . . W R V
> 0xffffaf807e720000-0xffffaf807e800000    0x00000000fe720000       896K
> PTE     D A G . . W R V
> 0xffffaf807e800000-0xffffaf807fe00000    0x00000000fe800000        22M
> PMD     D A G . . W R V
> 0xffffaf807fe00000-0xffffaf807ff54000    0x00000000ffe00000      1360K
> PTE     D A G . . W R V
> 0xffffaf807ff55000-0xffffaf8080000000    0x00000000fff55000       684K
> PTE     D A G . . W R V
> 0xffffaf8080000000-0xffffaf83c0000000    0x0000000100000000        13G
> PUD     D A G . . W R V
> 0xffffaf83c0000000-0xffffaf83ffe00000    0x0000000440000000      1022M
> PMD     D A G . . W R V
> 0xffffaf83ffe00000-0xffffaf8400000000    0x000000047fe00000         2M
> PTE     D A G . . W R V
> 
> I found that it was easier to align phys_ram_base on the lower 2MB
> boundary. Aligning on the upper boundary is more complicated to me
> since there may be "something" between phys_ram_base and the upper 2MB
> boundary that needs to be accessed using the linear mapping (DT is
> accessed using fixmap so not a problem, initrd? ACPI tables? I don't
> know actually).
> 

And would there be possible influence from the diff between 
phys_ram_base and memblock_start_of_DRAM()?

> Weirdly simple though, I may be missing something, so any comment/test
> is welcome, it is currently running our internal CI.
> 
> Thanks,
> 
> Alex
> 
>>
>> Thanks,
>>
>> Alex
>>
>>>
>>>
>>> Using PUD/P4D/PGD pages for the linear mapping to improve the performance is marginal from a recent talk [4]
>>> from Mike Rapoport. OpenSbi had marked all the PMP-protected regions as "no-map" [5] to practice this talk.
>>>

I noticed best_map_size() still creates some huge-paged mapping (like 
above 2G PUD) with this patch.
How about to revert best_map_size() to disable huge-paged mapping to 
practice the Mike's talk.

>>> For all those reasons, let's revert these related commits:
>>>
>>> - commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
>>> - commit 49a0a3731596 ("riscv: Check the virtual alignment before choosing a map size")
>>> - commit ed309ce52218 ("RISC-V: mark hibernation as nonportable")
>>>
>>> [1]: https://lore.kernel.org/linux-riscv/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/
>>> [2]: https://lore.kernel.org/linux-kernel/20230530080425.18612-1-alexghiti@rivosinc.com/
>>> [3]: https://lore.kernel.org/linux-riscv/tencent_7C3B580B47C1B17C16488EC1@qq.com/
>>> [4]: https://lwn.net/Articles/931406/
>>> [5]: https://github.com/riscv-software-src/opensbi/commit/8153b2622b08802cc542f30a1fcba407a5667ab9
>>>
>>> Song Shuai (3):
>>>    Revert "RISC-V: mark hibernation as nonportable"
>>>    Revert "riscv: Check the virtual alignment before choosing a map size"
>>>    Revert "riscv: Use PUD/P4D/PGD pages for the linear mapping"
>>>
>>>   arch/riscv/Kconfig            |  5 +---
>>>   arch/riscv/include/asm/page.h | 16 -------------
>>>   arch/riscv/mm/init.c          | 43 +++++++----------------------------
>>>   arch/riscv/mm/physaddr.c      | 16 -------------
>>>   drivers/of/fdt.c              | 11 ++++-----
>>>   5 files changed, 14 insertions(+), 77 deletions(-)
>>>
>>> --
>>> 2.20.1
>>>
> 

-- 
Thanks
Song Shuai

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ