lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1343064870.26034.23.camel@twins>
Date:	Mon, 23 Jul 2012 19:34:30 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>,
	paulmck <paulmck@...ux.vnet.ibm.com>,
	Hugh Dickins <hughd@...gle.com>
Cc:	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Nick Piggin <npiggin@...nel.dk>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	linux-arch <linux-arch@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: [RFC] page-table walkers vs memory order


While staring at mm/huge_memory.c I found a very under-commented
smp_wmb() in __split_huge_page_map(). It turns out that its copied from
__{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
thereto).

Now, afaict we're not good, as per that comment. Paul has since
convinced some of us that compiler writers are pure evil and out to get
us.

Therefore we should do what rcu_dereference() does and use
ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
we dereference a page-table pointer.


In particular it looks like things like
mm/memcontrol.c:mem_cgroup_count_precharge(), which use
walk_page_range() under down_read(&mm->mmap_sem) and can thus be
concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
itself) are quite broken on Alpha and subtly broken for those of us with
'creative' compilers.

Should I go do a more extensive audit of page-table walkers or are we
happy with the status quo?

---
 arch/x86/mm/gup.c |    6 +++---
 mm/pagewalk.c     |   24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4958fb1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 
 	pmdp = pmd_offset(&pud, addr);
 	do {
-		pmd_t pmd = *pmdp;
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
 		/*
@@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 
 	pudp = pud_offset(&pgd, addr);
 	do {
-		pud_t pud = *pudp;
+		pud_t pud = ACCESS_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
@@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	local_irq_save(flags);
 	pgdp = pgd_offset(mm, addr);
 	do {
-		pgd_t pgd = *pgdp;
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
 
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..2ba2a74 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pte = pte_offset_map(pmd, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	for (;;) {
 		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
@@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pmd = pmd_offset(pud, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	do {
 again:
 		next = pmd_addr_end(addr, end);
@@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pud = pud_offset(pgd, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ