lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160531125253.GK26128@dhcp22.suse.cz>
Date:	Tue, 31 May 2016 14:52:53 +0200
From:	Michal Hocko <mhocko@...nel.org>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Mel Gorman <mgorman@...hsingularity.net>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> sorry for delay,
> 
> On 05/25, Michal Hocko wrote:
[...]
> > This is a bit surprising but my testing shows that the result shouldn't
> > make much difference. I can see some discrepancies between lru_vec size
> > and zone_reclaimable_pages but they are too small to actually matter.
> 
> Yes, the difference is small but it does matter.
> 
> I do not pretend I understand this all, but finally it seems I understand
> whats going on on my system when it hangs. At least, why the change in
> lruvec_lru_size() or calculate_normal_threshold() makes a difference.
> 
> This single change in get_scan_count() under for_each_evictable_lru() loop
> 
> 	-	size = lruvec_lru_size(lruvec, lru);
> 	+	size = zone_page_state_snapshot(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> 
> fixes the problem too.
> 
> Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> we do not even try to scan it, lruvec_lru_size() returns zero.

OK, you seem to be really seeing a different issue than me. My debugging
patch was showing when nothing was really isolated from the LRU lists
(both for shrink_{in}active_list.

> Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
> counter because it uses the _snapshot variant, this leads to livelock.
> 
> I guess this doesn't really matter, but in my particular case these
> ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
> logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
> to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] counters,
> so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
> numbers.

Hmm. I am not really sure we can use the _snapshot version in lruvec_lru_size.
It is called also outise of slow paths (like add_to_page_cache_lru).
Maybe a path like below would be acceptable for stable trees.

But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
Does that help as well? The patch is certainly needed for the oom
detection rework but it got merged one release cycle earlier.
--- 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3388ccbab7d6..9f46a29c06b6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -764,7 +764,8 @@ static inline struct zone *lruvec_zone(struct lruvec *lruvec)
 #endif
 }
 
-extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru,
+		bool precise);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..84420045090b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,11 +213,13 @@ bool zone_reclaimable(struct zone *zone)
 		zone_reclaimable_pages(zone) * 6;
 }
 
-unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, bool precise)
 {
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
 
+	if (precise)
+		return zone_page_state_snapshot(lruvec_zone(lruvec), NR_LRU_BASE + lru);
 	return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
 }
 
@@ -1902,8 +1904,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file)
 	if (!file && !total_swap_pages)
 		return false;
 
-	inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
-	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
+	inactive = lruvec_lru_size(lruvec, file * LRU_FILE, true);
+	active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE, true);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -2040,7 +2042,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * system is under heavy pressure.
 	 */
 	if (!inactive_list_is_low(lruvec, true) &&
-	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, true) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2066,10 +2068,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon in [0], file in [1]
 	 */
 
-	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
-		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
-	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
-		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, true) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, true);
+	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, true) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, true);
 
 	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
@@ -2107,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 			unsigned long size;
 			unsigned long scan;
 
-			size = lruvec_lru_size(lruvec, lru);
+			size = lruvec_lru_size(lruvec, lru, true);
 			scan = size >> sc->priority;
 
 			if (!scan && pass && force_scan)
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a75f8d2916a..509cdf7a6fc9 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -269,7 +269,7 @@ bool workingset_refault(void *shadow)
 	}
 	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
-	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, false);
 	rcu_read_unlock();
 
 	/*
-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ