linux-kernel - [PATCH 9/9] x86/clear_huge_page: make clear_contig

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20230403052233.1880567-10-ankur.a.arora@oracle.com>
Date:   Sun,  2 Apr 2023 22:22:33 -0700
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc:     torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
        luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
        willy@...radead.org, mgorman@...e.de, peterz@...radead.org,
        rostedt@...dmis.org, tglx@...utronix.de,
        vincent.guittot@...aro.org, jon.grimm@....com, bharata@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        ankur.a.arora@...cle.com
Subject: [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible

clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
chunk Allow preemption in the irqentry_exit path to make sure we don't
hold on to the CPU for an arbitrarily long period.

Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
hugetlb-2mb region, and then writes sequentially to the region, demand
faulting pages on the way.

This test, with a CONFIG_VOLUNTARY config shows the effects of this
change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
goes up (~15% on Icelakex, ~13% on Milan.)

  *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
  (mem=4GB/task, tasks=128)

  stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
  utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
  wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%



  *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
  (mem=1GB/task, tasks=512)

  stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
  utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
  wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%

The drop in stime is due to REP; STOS being more efficient for bigger
extents.  The increase in utime is due to cache effects of that change:
mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
faulting page; while x86/clear_huge_page only optimizes for cache
locality in the local neighbourhood of the faulting address.

This effect on utime is visible via the increased L1-dcache-load-misses
and LLC-load* and an increased backend boundedness for perf user-stat
--all-user on Icelakex. The effect is slight but given the heavy cache
pressure generated by the test, shows up in the drop in user IPC:

    -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
    -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
    -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
    -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
    -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)

    +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
    +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
    +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
    +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
    +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)

Given the fact that the stime improves for all loads using this path,
while the utime drop is load dependent add this change.

Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
---
 arch/x86/mm/hugetlbpage.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 4294b77c4f18..c8564b0552e5 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 static void clear_contig_region(struct page *page, unsigned long vaddr,
 				unsigned int npages)
 {
+	might_sleep();
+
+	/*
+	 * We might be clearing a large region.
+	 * Allow rescheduling.
+	 */
+	allow_resched();
 	clear_user_pages(page_address(page), vaddr, page, npages);
+	disallow_resched();
+
+	cond_resched();
 }
 
 void clear_huge_page(struct page *page,
-- 
2.31.1