linux-kernel - Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <516ED378.2000406@linux.intel.com>
Date:	Wed, 17 Apr 2013 09:53:12 -0700
From:	Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
To:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
CC:	akpm@...ux-foundation.org, mgorman@...e.de,
	matthew.garrett@...ula.com, dave@...1.net, rientjes@...gle.com,
	riel@...hat.com, arjan@...ux.intel.com,
	maxime.coquelin@...ricsson.com, loic.pallardy@...ricsson.com,
	kamezawa.hiroyu@...fujitsu.com, lenb@...nel.org, rjw@...k.pl,
	gargankita@...il.com, paulmck@...ux.vnet.ibm.com,
	amit.kachhap@...aro.org, svaidy@...ux.vnet.ibm.com,
	andi@...stfloor.org, wujianguo@...wei.com, kmpark@...radead.org,
	thomas.abraham@...aro.org, santosh.shilimkar@...com,
	linux-pm@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
> [I know, this cover letter is a little too long, but I wanted to clearly
> explain the overall goals and the high-level design of this patchset in
> detail. I hope this helps more than it annoys, and makes it easier for
> reviewers to relate to the background and the goals of this patchset.]
>
>
> Overview of Memory Power Management and its implications to the Linux MM
> ========================================================================
>
> Today, we are increasingly seeing computer systems sporting larger and larger
> amounts of RAM, in order to meet workload demands. However, memory consumes a
> significant amount of power, potentially upto more than a third of total system
> power on server systems. So naturally, memory becomes the next big target for
> power management - on embedded systems and smartphones, and all the way upto
> large server systems.
>
> Power-management capabilities in modern memory hardware:
> -------------------------------------------------------
>
> Modern memory hardware such as DDR3 support a number of power management
> capabilities - for instance, the memory controller can automatically put
> memory DIMMs/banks into content-preserving low-power states, if it detects
> that that *entire* memory DIMM/bank has not been referenced for a threshold
> amount of time, thus reducing the energy consumption of the memory hardware.
> We term these power-manageable chunks of memory as "Memory Regions".
>
> Exporting memory region info of the platform to the OS:
> ------------------------------------------------------
>
> The OS needs to know about the granularity at which the hardware can perform
> automatic power-management of the memory banks (i.e., the address boundaries
> of the memory regions). On ARM platforms, the bootloader can be modified to
> pass on this info to the kernel via the device-tree. On x86 platforms, the
> new ACPI 5.0 spec has added support for exporting the power-management
> capabilities of the memory hardware to the OS in a standard way[5].
>
> Estimate of power-savings from power-aware Linux MM:
> ---------------------------------------------------
>
> Once the firmware/bootloader exports the required info to the OS, it is upto
> the kernel's MM subsystem to make the best use of these capabilities and manage
> memory power-efficiently. It had been demonstrated on a Samsung Exynos board
> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
> making the Linux kernel MM subsystem power-aware[4]. (More savings can be
> expected on systems with larger amounts of memory, and perhaps improved further
> using better MM designs).
>
>
> Role of the Linux MM in enhancing memory power savings:
> ------------------------------------------------------
>
> Often, this simply translates to having the Linux MM understand the granularity
> at which RAM modules can be power-managed, and keeping the memory allocations
> and references consolidated to a minimum no. of these power-manageable
> "memory regions". It is of particular interest to note that most of these memory
> hardware have the intelligence to automatically save power, by putting memory
> banks into (content-preserving) low-power states when not referenced for a
> threshold amount of time. All that the kernel has to do, is avoid wrecking
> the power-savings logic by scattering its allocations and references all over
> the system memory. (The kernel/MM doesn't have to perform the actual power-state
> transitions; its mostly done in the hardware automatically, and this is OK
> because these are *content-preserving* low-power states).
>
> So we can summarize the goals for the Linux MM as:
>
> o Consolidate memory allocations and/or references such that they are not
> spread across the entire memory address space.  Basically the area of memory
> that is not being referenced can reside in low power state.
>
> o Support light-weight targetted memory compaction/reclaim, to evacuate
> lightly-filled memory regions. This helps avoid memory references to
> those regions, thereby allowing them to reside in low power states.
>
>
> Assumptions and goals of this patchset:
> --------------------------------------
>
> In this patchset, we don't handle the part of getting the region boundary info
> from the firmware/bootloader and populating it in the kernel data-structures.
> The aim of this patchset is to propose and brainstorm on a power-aware design
> of the Linux MM which can *use* the region boundary info to influence the MM
> at various places such as page allocation, reclamation/compaction etc, thereby
> contributing to memory power savings. (This patchset is very much an RFC at
> the moment and is not intended for mainline-inclusion yet).
>
> So, in this patchset, we assume a simple model in which each 512MB chunk of
> memory can be independently power-managed, and hard-code this into the patchset.
> As mentioned, the focus of this patchset is not so much on how we get this info
> from the firmware or how exactly we handle a variety of configurations, but
> rather on discussing the power-savings/performance impact of the MM algorithms
> that *act* upon this info in order to save memory power.
>
> That said, its not very far-fetched to try this out with actual region
> boundary info to get the actual power savings numbers. For example, on ARM
> platforms, we can make the bootloader export this info to the OS via device-tree
> and then run this patchset. (This was the method used to get the power-numbers
> in [4]). But even without doing that, we can very well evaluate the
> effectiveness of this patchset in contributing to power-savings, by analyzing
> the free page statistics per-memory-region; and we can observe the performance
> impact by running benchmarks - this is the approach currently used to evaluate
> this patchset.
>
>
> Brief overview of the design/approach used in this patchset:
> -----------------------------------------------------------
>
> This patchset implements the 'Sorted-buddy design' for Memory Power Management,
> in which the buddy (page) allocator is altered to keep the buddy freelists
> region-sorted, which helps influence the page allocation paths to keep the
> allocations consolidated to a minimum no. of memory regions. This patchset also
> includes a light-weight targetted compaction/reclaim algorithm that works
> hand-in-hand with the page-allocator, to evacuate lightly-filled memory regions
> when memory gets fragmented, in order to further enhance memory power savings.
>
> This Sorted-buddy design was developed based on some of the suggestions
> received[1] during the review of the earlier patchset on Memory Power
> Management written by Ankita Garg ('Hierarchy design')[2].
> One of the key aspects of this Sorted-buddy design is that it avoids the
> zone-fragmentation problem that was present in the earlier design[3].
>
>
>
> Design of sorted buddy allocator and light-weight targetted region compaction:
> =============================================================================
>
> Sorted buddy allocator:
> ----------------------
>
> In this design, the memory region boundaries are captured in a data structure
> parallel to zones, instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions. (The freelists need not be fully
> address-sorted, they just need to be region-sorted).
>
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform region-compaction in the reverse order, as
> illustrated below.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation--->               <---Direction of region-compaction
>
>
> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and hence the critical page-allocation paths remain
> fast. Also, the sorting logic is optimized to be O(log n).
>
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
>     hence we avoid its associated problems (like too many zones, extra kswapd
>     activity, question of choosing watermarks etc).
>     [This is an advantage over the 'Hierarchy' design]
>
> 2. Performance overhead is expected to be low: Since we retain the simplicity
>     of the algorithm in the page allocation path, page allocation can
>     potentially remain as fast as it would be without memory regions. The
>     overhead is pushed to the page-freeing paths which are not that critical.
>
>
> Light-weight targetted region compaction:
> ----------------------------------------
>
> Over time, due to multiple alloc()s and free()s in random order, memory gets
> fragmented, which means the memory allocations will no longer be consolidated
> to a minimum no. of memory regions. In such cases we need a light-weight
> mechanism to opportunistically compact memory to evacuate lightly-filled
> memory regions, thereby enhancing the power-savings.
>
> Noting that CMA (Contiguous Memory Allocator) does targetted compaction to
> achieve its goals, this patchset generalizes the targetted compaction code
> and reuses it to evacuate memory regions. The region evacuation is triggered
> by the page allocator : when it notices the first page allocation in a new
> region, it sets up a worker function to perform compaction and evacuate that
> region in the future, if possible. There are handshakes between the alloc
> and the free paths in the page allocator to help do this smartly, which are
> explained in detail in the patches.
>
>
> This patchset has been hosted in the below git tree. It applies cleanly on
> v3.9-rc5.
>
> git://github.com/srivatsabhat/linux.git mem-power-mgmt-v2
>
>
> Changes in this v2:
> ==================
>
> * Fixed a bug in the NUMA case.
> * Added a new optimized O(log n) sorting algorithm to speed up region-sorting
>    of the buddy freelists (patch 9). The efficiency of this new algorithm and
>    its design allows us to support large amounts of RAM quite easily.
> * Added light-weight targetted compaction/reclaim support for memory power
>    management (patches 10-14).
> * Revamped the cover-letter to better explain the idea behind memory power
>    management and this patchset.
>
>
> Experimental Results:
> ====================
>
> Test setup:
> ----------
>
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
>
> Functional testing:
> ------------------
>
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
>
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
>
>
> 	     Present pages   |	Free pages at various allocation sizes   |
> 			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
>    Region 0           1	     |      0   |      0   |       0  |       0  |
>    Region 1      131072       |  41537   |  13858   |   13790  |   13334  |
>    Region 2      131072       | 131072   |  26839   |      82  |     122  |
>    Region 3      131072       | 131072   | 131072   |   26624  |       0  |
>    Region 4      131072       | 131072   | 131072   |  131072  |       0  |
>    Region 5      131072       | 131072   | 131072   |  131072  |   26624  |
>    Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
>    Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
>    Region 8      131071       |  72704   |  72704   |   72704  |   72704  |
>
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
>
> Performance impact:
> -------------------
>
> Kernbench results didn't show any noticeable performance degradation with
> this patchset as compared to vanilla 3.9-rc5.
>
>
> Todos and ideas for enhancing the design further:
> ================================================
>
> 1. Add support for making this work with sparsemem, memcg etc.
>
> 2. Mel Gorman pointed out that regular compaction algorithm would work
>     against the sorted-buddy allocation strategy, since it creates free space
>     at lower pfns. For now, I have not handled this because regular compaction
>     triggers only when the memory pressure is very high, and hence memory
>     power management is pointless in those situations. Besides, it is
>     immaterial whether memory allocations are consolidated towards lower or
>     higher pfns, because it saves power either way, and hence the regular
>     compaction algorithm doesn't actually work against memory power management.
>
> 3. Add more optimizations to the targetted region compaction algorithm in order
>     to enhance its benefits and reduce the overhead, such as:
>     a. Migrate only active pages during region evacuation, because, strictly
>        speaking we only want to avoid _references_ to the region. So inactive
>        pages can be kept around, thus reducing the page-migration overhead.
>     b. Reduce the search-space for region evacuation, by having the
>        page-allocator note down the highest allocated pfn within that region.
>
> 4. Have stronger influence over how freepages from different migratetypes
>     are exchanged, so that unmovable and non-reclaimable allocations are
>     contained within least no. of memory regions.
>
> 5. Influence the refill of per-cpu pagesets and perhaps even heavily used
>     slab caches, such that they all get their memory from least no. of memory
>     regions. This is to avoid frequent fragmentation of memory regions.
>
> 6. Don't perform region evacuation at situations of high memory utilization.
>     Also, never use freepages from MIGRATE_RESERVE for the purpose of
>     region-evacuation.
>
> 7. Add more tracing/debug info to enable better evaluation of the
>     effectiveness and benefits of this patchset over vanilla kernel.
>
> 8. Add a higher level policy to control the aggressiveness of memory power
>     management.
>
>
> References:
> ----------
>
> [1]. Review comments suggesting modifying the buddy allocator to be aware of
>       memory regions:
>       http://article.gmane.org/gmane.linux.power-management.general/24862
>       http://article.gmane.org/gmane.linux.power-management.general/25061
>       http://article.gmane.org/gmane.linux.kernel.mm/64689
>
> [2]. Patch series that implemented the node-region-zone hierarchy design:
>       http://lwn.net/Articles/445045/
>       http://thread.gmane.org/gmane.linux.kernel.mm/63840
>
>       Summary of the discussion on that patchset:
>       http://article.gmane.org/gmane.linux.power-management.general/25061
>
>       Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
>       http://thread.gmane.org/gmane.linux.kernel.mm/89202
>
> [3]. Disadvantages of having memory regions in the hierarchy between nodes and
>       zones:
>       http://article.gmane.org/gmane.linux.kernel.mm/63849
>
> [4]. Estimate of potential power savings on Samsung exynos board
>       http://article.gmane.org/gmane.linux.kernel.mm/65935
>
> [5]. ACPI 5.0 and MPST support
>       http://www.acpi.info/spec.htm
>       Section 5.2.21 Memory Power State Table (MPST)
>
> [6]. v1 of Sorted-buddy memory power management patchset:
>       http://thread.gmane.org/gmane.linux.power-management.general/28498
>
>
>   Srivatsa S. Bhat (15):
>        mm: Introduce memory regions data-structure to capture region boundaries within nodes
>        mm: Initialize node memory regions during boot
>        mm: Introduce and initialize zone memory regions
>        mm: Add helpers to retrieve node region and zone region for a given page
>        mm: Add data-structures to describe memory regions within the zones' freelists
>        mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
>        mm: Add an optimized version of del_from_freelist to keep page allocation fast
>        bitops: Document the difference in indexing between fls() and __fls()
>        mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
>        mm: Add support to accurately track per-memory-region allocation
>        mm: Restructure the compaction part of CMA for wider use
>        mm: Add infrastructure to evacuate memory regions using compaction
>        mm: Implement the worker function for memory region compaction
>        mm: Add alloc-free handshake to trigger memory region compaction
>        mm: Print memory region statistics to understand the buddy allocator behavior
>
>
>    arch/x86/include/asm/bitops.h      |    4
>   include/asm-generic/bitops/__fls.h |    5
>   include/linux/compaction.h         |    7
>   include/linux/gfp.h                |    2
>   include/linux/migrate.h            |    3
>   include/linux/mm.h                 |   62 ++++
>   include/linux/mmzone.h             |   78 ++++-
>   include/trace/events/migrate.h     |    3
>   mm/compaction.c                    |  149 +++++++++
>   mm/internal.h                      |   40 ++
>   mm/page_alloc.c                    |  617 ++++++++++++++++++++++++++++++++----
>   mm/vmstat.c                        |   36 ++
>   12 files changed, 935 insertions(+), 71 deletions(-)
>
>
> Regards,
> Srivatsa S. Bhat
> IBM Linux Technology Center
>
>
One thing you need to prevent is boot time allocation. You have to make 
sure that frequently accessed per node data stored at the end of memory 
will keep all ranks of memory active.

Thanks,
Srinivas



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/