linux-kernel - [PATCH v4] mm: swap: async free swap slot cache entries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240214-async-free-v4-1-6abe0d59f85f@kernel.org>
Date: Wed, 14 Feb 2024 17:02:13 -0800
From: Chris Li <chrisl@...nel.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
 Wei Xu <weixugc@...gle.com>, Yu Zhao <yuzhao@...gle.com>, 
 Greg Thelen <gthelen@...gle.com>, Chun-Tse Shao <ctshao@...gle.com>, 
 Yosry Ahmed <yosryahmed@...gle.com>, Michal Hocko <mhocko@...e.com>, 
 Mel Gorman <mgorman@...hsingularity.net>, Huang Ying <ying.huang@...el.com>, 
 Nhat Pham <nphamcs@...il.com>, Kairui Song <kasong@...cent.com>, 
 Barry Song <v-songbaohua@...o.com>, Tim Chen <tim.c.chen@...ux.intel.com>, 
 Chris Li <chrisl@...nel.org>
Subject: [PATCH v4] mm: swap: async free swap slot cache entries

We discovered that 1% swap page fault is 100us+ while 50% of
the swap fault is under 20us.

Further investigation shows that a large portion of the time
spent in the free_swap_slots() function for the long tail case.

The percpu cache of swap slots is freed in a batch of 64 entries
inside free_swap_slots(). These cache entries are accumulated
from previous page faults, which may not be related to the current
process.

Doing the batch free in the page fault handler causes longer
tail latencies and penalizes the current process.

When the swap cache slot is full, schedule async free cached
swap slots in a work queue, before the next swap fault comes in.
If the next swap fault comes in very fast, before the async
free gets a chance to run. It will directly free all the swap
cache in the swap fault the same way as previously.

Testing:

Chun-Tse did some benchmark in chromebook, showing that
zram_wait_metrics improve about 15% with 80% and 95% confidence.

I recently ran some experiments on about 1000 Google production
machines. It shows swapin latency drops in the long tail
100us - 500us bucket dramatically.

platform	(100-500us)	 	(0-100us)
A		1.12% -> 0.36%		98.47% -> 99.22%
B		0.65% -> 0.15%		98.96% -> 99.46%
C		0.61% -> 0.23%		98.96% -> 99.38%

Signed-off-by: Chris Li <chrisl@...nel.org>
---
Changes in v4:
- Remove the sysfs interface file, according the feedback.
- Move the full condition test inside the spinlock.
- Link to v3: https://lore.kernel.org/r/20240213-async-free-v3-1-b89c3cc48384@kernel.org

Changes in v3:
- Address feedback from Tim Chen, direct free path will free all swap slots.
- Add /sys/kernel/mm/swap/swap_slot_async_fee to enable async free. Default is off.
- Link to v2: https://lore.kernel.org/r/20240131-async-free-v2-1-525f03e07184@kernel.org

Changes in v2:
- Add description of the impact of time changing suggest by Ying.
- Remove create_workqueue() and use schedule_work()
- Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
---
 include/linux/swap_slots.h |  1 +
 mm/swap_slots.c            | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..67bc8fa30d63 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -19,6 +19,7 @@ struct swap_slots_cache {
 	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
 	swp_entry_t	*slots_ret;
 	int		n_ret;
+	struct work_struct async_free;
 };
 
 void disable_swap_slots_cache_lock(void);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..23dc04bce9ca 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -44,6 +44,7 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
 static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
 
 static void __drain_swap_slots_cache(unsigned int type);
+static void swapcache_async_free_entries(struct work_struct *data);
 
 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
 #define SLOTS_CACHE 0x1
@@ -149,6 +150,7 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 		spin_lock_init(&cache->free_lock);
 		cache->lock_initialized = true;
 	}
+	INIT_WORK(&cache->async_free, swapcache_async_free_entries);
 	cache->nr = 0;
 	cache->cur = 0;
 	cache->n_ret = 0;
@@ -269,12 +271,27 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	return cache->nr;
 }
 
+static void swapcache_async_free_entries(struct work_struct *data)
+{
+	struct swap_slots_cache *cache;
+
+	cache = container_of(data, struct swap_slots_cache, async_free);
+	spin_lock_irq(&cache->free_lock);
+	/* Swap slots cache may be deactivated before acquiring lock */
+	if (cache->slots_ret && cache->n_ret) {
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+	}
+	spin_unlock_irq(&cache->free_lock);
+}
+
 void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
 
 	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
+		bool full;
 		spin_lock_irq(&cache->free_lock);
 		/* Swap slots cache may be deactivated before acquiring lock */
 		if (!use_swap_slot_cache || !cache->slots_ret) {
@@ -292,7 +309,10 @@ void free_swap_slot(swp_entry_t entry)
 			cache->n_ret = 0;
 		}
 		cache->slots_ret[cache->n_ret++] = entry;
+		full = cache->n_ret >= SWAP_SLOTS_CACHE_SIZE;
 		spin_unlock_irq(&cache->free_lock);
+		if (full)
+			schedule_work(&cache->async_free);
 	} else {
 direct_free:
 		swapcache_free_entries(&entry, 1);

---
base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
change-id: 20231216-async-free-bef392015432

Best regards,
-- 
Chris Li <chrisl@...nel.org>