lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240809103129.365029-1-dev.jain@arm.com>
Date: Fri,  9 Aug 2024 16:01:27 +0530
From: Dev Jain <dev.jain@....com>
To: akpm@...ux-foundation.org,
	shuah@...nel.org,
	david@...hat.com,
	willy@...radead.org
Cc: ryan.roberts@....com,
	anshuman.khandual@....com,
	catalin.marinas@....com,
	cl@...two.org,
	vbabka@...e.cz,
	mhocko@...e.com,
	apopple@...dia.com,
	osalvador@...e.de,
	baolin.wang@...ux.alibaba.com,
	dave.hansen@...ux.intel.com,
	will@...nel.org,
	baohua@...nel.org,
	ioworker0@...il.com,
	gshan@...hat.com,
	mark.rutland@....com,
	kirill.shutemov@...ux.intel.com,
	hughd@...gle.com,
	aneesh.kumar@...nel.org,
	yang@...amperecomputing.com,
	peterx@...hat.com,
	broonie@...nel.org,
	mgorman@...hsingularity.net,
	linux-arm-kernel@...ts.infradead.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	linux-kselftest@...r.kernel.org,
	Dev Jain <dev.jain@....com>
Subject: [PATCH 0/2] Improve migration by backing off earlier

It was recently observed at [1] that during the folio unmapping stage
of migration, when the PTEs are cleared, a racing thread faulting on that
folio may increase the refcount of the folio, sleep on the folio lock
(the migration path has the lock), and migration ultimately fails
when asserting the actual refcount against the expected.

Migration is a best effort service; the unmapping and the moving phase
are wrapped around loops for retrying. The refcount of the folio is
currently being asserted during the move stage; if it fails, we retry.
But, if a racing thread changes the refcount, and ends up sleeping on the
folio lock (which is mostly the case), there is no way the refcount would
be decremented; as a result, this renders the retrying useless. In the
first patch, we make the refcount check also during the unmap stage; if
it fails, we restore the original state of the PTE, drop the folio lock,
let the system make progress, and retry unmapping again. This improves the
probability of migration winning the race.

Given that migration is a best-effort service, it is wrong to fail the
test for just a single failure; hence, fail the test after 100 consecutive
failures (where 100 is still a subjective choice).

[1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/

Dev Jain (2):
  mm: Retry migration earlier upon refcount mismatch
  selftests/mm: Do not fail test for a single migration failure

 mm/migrate.c                           |  9 +++++++++
 tools/testing/selftests/mm/migration.c | 17 +++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

-- 
2.30.2


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ