lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20200821015523.1698374-10-harshads@google.com>
Date:   Thu, 20 Aug 2020 18:55:23 -0700
From:   Harshad Shirwadkar <harshadshirwadkar@...il.com>
To:     linux-ext4@...r.kernel.org
Cc:     tytso@....edu, lyx1209@...il.com,
        Harshad Shirwadkar <harshadshirwadkar@...il.com>
Subject: [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code

From: Harshad Shirwadkar <harshadshirwadkar@...il.com>

This patch adds a comment in mballoc.c describing the design and flow
of free-space trees.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@...il.com>
---
 fs/ext4/mballoc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index eb99e2fb9a68..b5d55a52daff 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -330,6 +330,122 @@
  *        object
  *
  */
+
+/*
+ * The freespace tree allocator
+ * -----------------------------
+ *
+ * High Level Design
+ * =================
+ * This allocator maintains the free space information about the file system in
+ * per-flexible block group level trees. For every flexible block group, we
+ * maintain individual freespace nodes in two trees, one sorted by flex-bg
+ * offset and other sorted by length. This gives us two benefits: i) In
+ * allocation path, our search time complexity is O(Log(Number of freespace
+ * regions in the flex-bg)). ii) In free path, in the same time complexity we
+ * can merge the adjacent nodes thereby reducing the size of the tree
+ * efficiently.
+ *
+ * Along with flexible block group level trees, we also introduce a top level
+ * meta-tree which consists of individual trees. This tree is sorted by length
+ * of largest extents found in flex-bgs. The key advantage that this tree gives
+ * us is this: During an allocation request, the allocator is now able to
+ * consult this tree and directly (in O(Log(Number of Flex BGs)) jump to a
+ * flexible block group which definitely has at least one (the largest) extent
+ * that can satisfy this request. If no flexible block group exists which can
+ * satisfy this request, the allocator can now immediately drop the CR level.
+ *
+ * In order to preseve the parallel allocation / free performance, the allocator
+ * only *tries* to consult this tree. It does so by calling read_trylock()
+ * function and if the meta-tree is busy, the allocator continues its linear
+ * search until it is able to grab a read-lock on this tree.
+ *
+ * Memory Footprint
+ * ================
+ *
+ * In a less fragmented file system, the memory footprint of the new allocator
+ * is significantly lower than buddy bitmaps. However, as the fragmentation
+ * level increases, the memory footprint of this allocator increases
+ * significantly. The memory usage of the freespace tree allocator can be
+ * controlled using sysfs tunable /sys/fs/ext4/<dev>/mb_frsp_max_mem. The
+ * default value of this is max(memory needed for buddy, maximum memory needed
+ * for one tree in the worst case). Accounting for max memory needed for one
+ * tree allows us to keep at least one tree in memory even in the worst
+ * case. This avoids thrashing. Once the memory threshold is reached, the
+ * allocator starts evicting trees from memory in LRU fashion. However, we don't
+ * remove tree's entry from the meta-tree. This allows the allocator to
+ * efficiently reconstruct only relevant trees from on-disk bitmaps. In our
+ * evaluations, we found that freespace tree allocator still manages to
+ * outperform buddy allocator in memory crunch situation.
+ *
+ * LRU
+ * ===
+ *
+ * We maintain two lists - an active list and an inactive list of trees. Trees
+ * in active list stay in it until evicted. In order to reduce contention on the
+ * active list lock, once a tree is in active list it is not moved inside the
+ * list. Also, a tree moves from inactive list to active list only if it is
+ * accessed twice in a EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES window.
+ *
+ *
+ * Caching Tree Metadata
+ * =====================
+ * As noted in our experiments, we find caching tree metadata (the largest
+ * available extent in a tree) in the meta-tree significantly boosts allocation
+ * performance. However, if the allocator waits for the cache to fill up before
+ * starting to serve allocation requests, that may severely degrade allocation
+ * performance on large disks. Thus, it is important to tune the tree caching
+ * behavior according to the underlying block device. This patchset provides
+ * four cache aggression levels. Cache aggression level can be specified as a
+ * mount time parametere "frsp_cache_aggression". Here is the meaning of these
+ * different levels:
+ *
+ * Cache Aggression 0: Try to serve request only cached trees
+ * Cache Aggression 1 (Default): Try to serve request from only cached trees
+ *	at CR 0. At all other CRs, try to use an uncached tree for every
+ *	alternate request.
+ * Cache Aggression 2: Try to use an uncached tree for every alternate request
+ *	at all CR levels.
+ * Cache Aggression 3: Try to use uncached trees for every request.
+ *
+ * Moreover, if file system is mounted with "prefetch_block_bitmaps", tree
+ * caching starts at mount time.
+ *
+ * Locking Order
+ * =============
+ *
+ * - Tree lock
+ * - Meta tree lock (sbi->s_mb_frsp_lock)
+ * - LRU lock
+ *
+ * Critical sections of meta-tree lock and LRU locks are kept as small as
+ * possible and you shouldn't need to take meta-tree lock and lru-lock
+ * simultaneously.
+ *
+ * High Level Algorithm
+ * ====================
+ * - Consult meta-tree asking which flex-bg should the allocator look at
+ *   - One of the following things can happen
+ *     - Meta tree may find a suitable flex-bg for the request
+ *        - In this case we start searching in that tree
+ *     - Meta tree may realize that there's no suitable flex-bg
+ *        - In this case we increase CR and restart
+ *     - Based on the caching mode, meta tree may redirect us to an
+ *       uncached tree
+ *     - Meta tree is busy
+ *       - In this case, we dont wait for meta-tree to become available,
+ *         instead, we continue our search linearly.
+ * - Pick a tree (either based on what meta-tree told us or linearly from the
+ *   last one)
+ * - Manage LRU structure (ext4_mb_frsp_maintain_lru())
+ *   - Move tree to active list if needed
+ *   - Move trees from active list to inactive lists if needed
+ * - Perform search by length and pick matching extents.
+ * - Repeat until best found
+ * - Perform memory-crunch check and evict trees in LRU fashion if needed
+ *   (ext4_mb_unload_allocator()))
+ */
+
 static struct kmem_cache *ext4_pspace_cachep;
 static struct kmem_cache *ext4_ac_cachep;
 static struct kmem_cache *ext4_free_data_cachep;
-- 
2.28.0.297.g1956fa8f8d-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ