linux-kernel - Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b031938-9c82-4f09-b5dc-c45bc7fe6e07@amd.com>
Date: Wed, 31 Jul 2024 23:15:24 +0530
From: Shivank Garg <shivankg@....com>
To: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
 Thomas Gleixner <tglx@...utronix.de>
Cc: ardb@...nel.org, bp@...en8.de, brijesh.singh@....com, corbet@....net,
 dave.hansen@...ux.intel.com, hpa@...or.com, jan.kiszka@...mens.com,
 jgross@...e.com, kbingham@...nel.org, linux-doc@...r.kernel.org,
 linux-efi@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 luto@...nel.org, michael.roth@....com, mingo@...hat.com,
 peterz@...radead.org, rick.p.edgecombe@...el.com, sandipan.das@....com,
 thomas.lendacky@....com, x86@...nel.org
Subject: Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for
 x86-64

On 7/31/2024 5:06 PM, Kirill A. Shutemov wrote:
> On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
>> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
>>> lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
>>>                 4-Level PT              5-Level PT		% Change
>>> THP-never       Mean:0.4068             Mean:0.4294		5.56
>>>                 95% CI:0.4057-0.4078    95% CI:0.4287-0.4302
>>>
>>> THP-Always      Mean: 0.4061            Mean: 0.4288		% Change
>>>                 95% CI: 0.4051-0.4071   95% CI: 0.4281-0.4295	5.59
>>>
>>> Inference:
>>> 5-level page table shows increase in page-fault latency but it does
>>> not significantly impact other benchmarks.
>>
>> 5% regression on lmbench is a NONO.
> 
> Yeah, that's a biggy.
> 
> In our testing (on Intel HW) we didn't see any significant difference
> between 4- and 5-level paging. But we were focused on TLB fill latency.
> In both bare metal and in VMs. Maybe something wrong in the fault path?
> 
> It requires a closer look.
> 
> Shivank, could you share how you run lat_pagefault? What file size? How
> parallel you run it?...

Hi Kirill,

I got lmbench from here:
https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_pagefault.c

and using this command:
numactl --membind=1 --cpunodebind=1 bin/x86_64-linux-gnu/lat_pagefault -N 100 1GB_dev_urandom_file

> 
> It would also be nice to get perf traces. Maybe it is purely SW issue.
> 

4-level-page-table:
      - 52.31% benchmark
         - 49.52% asm_exc_page_fault
            - 49.35% exc_page_fault
               - 48.36% do_user_addr_fault
                  - 46.15% handle_mm_fault
                     - 44.59% __handle_mm_fault
                        - 42.95% do_fault
                           - 40.89% filemap_map_pages
                              - 28.30% set_pte_range
                                 - 23.70% folio_add_file_rmap_ptes
                                    - 14.30% __lruvec_stat_mod_folio
                                       - 10.12% __mod_lruvec_state
                                          - 5.70% __mod_memcg_lruvec_state
                                               0.60% cgroup_rstat_updated
                                            1.06% __mod_node_page_state
                                      2.84% __rcu_read_unlock
                                      0.76% srso_alias_safe_ret
                                   0.84% set_ptes.isra.0
                              - 5.48% next_uptodate_folio
                                 - 1.19% xas_find
                                      0.96% xas_load
                                1.00% set_ptes.isra.0
                    1.22% lock_vma_under_rcu


5-level-page-table:
      - 52.75% benchmark
         - 50.04% asm_exc_page_fault
            - 49.90% exc_page_fault
               - 48.91% do_user_addr_fault
                  - 46.74% handle_mm_fault
                     - 45.27% __handle_mm_fault
                        - 43.30% do_fault
                           - 41.58% filemap_map_pages
                              - 28.04% set_pte_range
                                 - 22.77% folio_add_file_rmap_ptes
                                    - 17.74% __lruvec_stat_mod_folio
                                       - 10.89% __mod_lruvec_state
                                          - 5.97% __mod_memcg_lruvec_state
                                               1.94% cgroup_rstat_updated
                                            1.09% __mod_node_page_state
                                         0.56% __mod_node_page_state
                                      2.28% __rcu_read_unlock
                                   1.08% set_ptes.isra.0
                              - 5.94% next_uptodate_folio
                                 - 1.13% xas_find
                                      0.99% xas_load
                                1.13% srso_alias_safe_ret
                                0.52% set_ptes.isra.0
                    1.16% lock_vma_under_rcu

>> 5-level page tables add a cost in every hardware page table walk. That's
>> a matter of fact and there is absolutely no reason to inflict this cost
>> on everyone.
>>
>> The solution to this to make the 5-level mechanics smarter by evaluating
>> whether the machine has enough memory to require 5-level tables and
>> select the depth at boot time.
> 
> Let's understand the reason first.

Sure, please let me know how can I help in this debug.

Thanks,
Shivank

> 
> The risk with your proposal is that 5-level paging will not get any
> testing and rot over time.
> 
> I would like to keep it on, if possible.
>