lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200623022636.GA1051134@ubuntu-n2-xlarge-x86>
Date:   Mon, 22 Jun 2020 19:26:36 -0700
From:   Nathan Chancellor <natechancellor@...il.com>
To:     Nitin Gupta <nigupta@...dia.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Khalid Aziz <khalid.aziz@...cle.com>,
        Oleksandr Natalenko <oleksandr@...hat.com>,
        Michal Hocko <mhocko@...e.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Matthew Wilcox <willy@...radead.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        David Rientjes <rientjes@...gle.com>,
        Nitin Gupta <ngupta@...ingupta.dev>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        Linux API <linux-api@...r.kernel.org>,
        linux-mips@...r.kernel.org
Subject: Re: [PATCH v8] mm: Proactive compaction

On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define a
> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
> for external fragmentation which kcompactd tries to maintain.
> 
> The tunable takes a value in range [0, 100], with a default of 20.
> 
> Note that a previous version of this patch [1] was found to introduce
> too many tunables (per-order extfrag{low, high}), but this one reduces
> them to just one sysctl. Also, the new tunable is an opaque value
> instead of asking for specific bounds of "external fragmentation", which
> would have been difficult to estimate. The internal interpretation of
> this opaque value allows for future fine-tuning.
> 
> Currently, we use a simple translation from this tunable to [low, high]
> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
> The score for a node is defined as weighted mean of per-zone external
> fragmentation. A zone's present_pages determines its weight.
> 
> To periodically check per-node score, we reuse per-node kcompactd
> threads, which are woken up every 500 milliseconds to check the same. If
> a node's score exceeds its high threshold (as derived from user-provided
> proactiveness value), proactive compaction is started until its score
> reaches its low threshold value. By default, proactiveness is set to 20,
> which implies threshold values of low=80 and high=90.
> 
> This patch is largely based on ideas from Michal Hocko [2]. See also the
> LWN article [3].
> 
> Performance data
> ================
> 
> System: x64_64, 1T RAM, 80 CPU threads.
> Kernel: 5.6.0-rc3 + this patch
> 
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
> 
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages, which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur when
> hugepage allocations hit direct compaction.
> 
> 1. Kernel hugepage allocation latencies
> 
> With the system in such a fragmented state, a kernel driver then
> allocates as many hugepages as possible and measures allocation
> latency:
> 
> (all latency values are in microseconds)
> 
> - With vanilla 5.6.0-rc3
> 
>   percentile latency
>   –––––––––– –––––––
> 	   5    7894
> 	  10    9496
> 	  25   12561
> 	  30   15295
> 	  40   18244
> 	  50   21229
> 	  60   27556
> 	  75   30147
> 	  80   31047
> 	  90   32859
> 	  95   33799
> 
> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> 762G total free => 98% of free memory could be allocated as hugepages)
> 
> - With 5.6.0-rc3 + this patch, with proactiveness=20
> 
> sysctl -w vm.compaction_proactiveness=20
> 
>   percentile latency
>   –––––––––– –––––––
> 	   5       2
> 	  10       2
> 	  25       3
> 	  30       3
> 	  40       3
> 	  50       4
> 	  60       4
> 	  75       4
> 	  80       4
> 	  90       5
> 	  95     429
> 
> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
> 762G total free => 98% of free memory could be allocated as hugepages)
> 
> 2. JAVA heap allocation
> 
> In this test, we first fragment memory using the same method as for (1).
> 
> Then, we start a Java process with a heap size set to 700G and request
> the heap to be allocated with THP hugepages. We also set THP to madvise
> to allow hugepage backing of this heap.
> 
> /usr/bin/time
>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
> 
> The above command allocates 700G of Java heap using hugepages.
> 
> - With vanilla 5.6.0-rc3
> 
> 17.39user 1666.48system 27:37.89elapsed
> 
> - With 5.6.0-rc3 + this patch, with proactiveness=20
> 
> 8.35user 194.58system 3:19.62elapsed
> 
> Elapsed time remains around 3:15, as proactiveness is further increased.
> 
> Note that proactive compaction happens throughout the runtime of these
> workloads. The situation of one-time compaction, sufficient to supply
> hugepages for following allocation stream, can probably happen for more
> extreme proactiveness values, like 80 or 90.
> 
> In the above Java workload, proactiveness is set to 20. The test starts
> with a node's score of 80 or higher, depending on the delay between the
> fragmentation step and starting the benchmark, which gives more-or-less
> time for the initial round of compaction. As t	he benchmark consumes
> hugepages, node's score quickly rises above the high threshold (90) and
> proactive compaction starts again, which brings down the score to the
> low threshold level (80).  Repeat.
> 
> bpftrace also confirms proactive compaction running 20+ times during the
> runtime of this Java benchmark. kcompactd threads consume 100% of one of
> the CPUs while it tries to bring a node's score within thresholds.
> 
> Backoff behavior
> ================
> 
> Above workloads produce a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, proactive compaction
> should essentially back off. To test this aspect:
> 
> - Created a kernel driver that allocates almost all memory as hugepages
>   followed by freeing first 3/4 of each hugepage.
> - Set proactiveness=40
> - Note that proactive_compact_node() is deferred maximum number of times
>   with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
>   (=> ~30 seconds between retries).
> 
> [1] https://patchwork.kernel.org/patch/11098289/
> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> [3] https://lwn.net/Articles/817905/
> 
> Signed-off-by: Nitin Gupta <nigupta@...dia.com>
> Reviewed-by: Vlastimil Babka <vbabka@...e.cz>
> Reviewed-by: Khalid Aziz <khalid.aziz@...cle.com>
> Reviewed-by: Oleksandr Natalenko <oleksandr@...hat.com>
> Tested-by: Oleksandr Natalenko <oleksandr@...hat.com>
> To: Andrew Morton <akpm@...ux-foundation.org>
> CC: Vlastimil Babka <vbabka@...e.cz>
> CC: Khalid Aziz <khalid.aziz@...cle.com>
> CC: Michal Hocko <mhocko@...e.com>
> CC: Mel Gorman <mgorman@...hsingularity.net>
> CC: Matthew Wilcox <willy@...radead.org>
> CC: Mike Kravetz <mike.kravetz@...cle.com>
> CC: Joonsoo Kim <iamjoonsoo.kim@....com>
> CC: David Rientjes <rientjes@...gle.com>
> CC: Nitin Gupta <ngupta@...ingupta.dev>
> CC: Oleksandr Natalenko <oleksandr@...hat.com>
> CC: linux-kernel <linux-kernel@...r.kernel.org>
> CC: linux-mm <linux-mm@...ck.org>
> CC: Linux API <linux-api@...r.kernel.org>

This is now in -next and causes the following build failure:

$ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mipsel-linux- O=out/mipsel distclean malta_kvm_guest_defconfig mm/compaction.o
In file included from include/linux/dev_printk.h:14,
                 from include/linux/device.h:15,
                 from include/linux/node.h:18,
                 from include/linux/cpu.h:17,
                 from mm/compaction.c:11:
In function 'fragmentation_score_zone',
    inlined from '__compact_finished' at mm/compaction.c:1982:11,
    inlined from 'compact_zone' at mm/compaction.c:2062:8:
include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |                                      ^
include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert'
  320 |    prefix ## suffix();    \
      |    ^~~~~~
include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert'
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |  ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
   39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
      |                                     ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
   59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
      |                     ^~~~~~~~~~~~~~~~
arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG'
   70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; })
      |                              ^~~~~~~~~
mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER'
   66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER
      |                                ^~~~~~~~~~~~~~~~~~
mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER'
 1898 |    extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
      |                            ^~~~~~~~~~~~~~~~~~~~~~
In function 'fragmentation_score_zone',
    inlined from 'kcompactd' at mm/compaction.c:1918:12:
include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |                                      ^
include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert'
  320 |    prefix ## suffix();    \
      |    ^~~~~~
include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert'
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |  ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
   39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
      |                                     ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
   59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
      |                     ^~~~~~~~~~~~~~~~
arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG'
   70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; })
      |                              ^~~~~~~~~
mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER'
   66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER
      |                                ^~~~~~~~~~~~~~~~~~
mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER'
 1898 |    extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
      |                            ^~~~~~~~~~~~~~~~~~~~~~
In function 'fragmentation_score_zone',
    inlined from 'kcompactd' at mm/compaction.c:1918:12:
include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |                                      ^
include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert'
  320 |    prefix ## suffix();    \
      |    ^~~~~~
include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert'
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |  ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
   39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
      |                                     ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
   59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
      |                     ^~~~~~~~~~~~~~~~
arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG'
   70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; })
      |                              ^~~~~~~~~
mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER'
   66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER
      |                                ^~~~~~~~~~~~~~~~~~
mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER'
 1898 |    extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
      |                            ^~~~~~~~~~~~~~~~~~~~~~
In function 'fragmentation_score_zone',
    inlined from 'kcompactd' at mm/compaction.c:1918:12:
include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |                                      ^
include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert'
  320 |    prefix ## suffix();    \
      |    ^~~~~~
include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert'
  339 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |  ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
   39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
      |                                     ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
   59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
      |                     ^~~~~~~~~~~~~~~~
arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG'
   70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; })
      |                              ^~~~~~~~~
mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER'
   66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER
      |                                ^~~~~~~~~~~~~~~~~~
mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER'
 1898 |    extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
      |                            ^~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1
make[3]: Target '__build' not remade because of errors.
make[2]: *** [Makefile:1765: mm] Error 2
make[2]: Target 'mm/compaction.o' not remade because of errors.
make[1]: *** [Makefile:336: __build_one_by_one] Error 2
make[1]: Target 'distclean' not remade because of errors.
make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors.
make[1]: Target 'mm/compaction.o' not remade because of errors.
make: *** [Makefile:185: __sub-make] Error 2
make: Target 'distclean' not remade because of errors.
make: Target 'malta_kvm_guest_defconfig' not remade because of errors.
make: Target 'mm/compaction.o' not remade because of errors.

I am not sure why MIPS is special with its handling of hugepage support
but I am far from a MIPS expert :)

Cheers,
Nathan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ