lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <patchbomb.1212192268@localhost>
Date:	Sat, 31 May 2008 01:04:28 +0100
From:	Jeremy Fitzhardinge <jeremy@...p.org>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
	xen-devel <xen-devel@...ts.xensource.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Hugh Dickins <hugh@...itas.com>,
	Zachary Amsden <zach@...are.com>,
	kvm-devel <kvm-devel@...ts.sourceforge.net>,
	Virtualization Mailing List <virtualization@...ts.osdl.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
	(take 2)

Hi all,

[ Change since last post: change name to ptep_modify_prot_, on the
  grounds that it isn't really a general pte-modification interface. ]

This little series adds a new transaction-like abstraction for doing
RMW updates to a pte, hooks it into paravirt_ops, and then makes use
of it in Xen.

The basic problem is that mprotect is very slow under Xen (up to 50x
slower than native), primarily because of the

	ptent = ptep_get_and_clear(mm, addr, pte);
	ptent = pte_modify(ptent, newprot);
	/* ... */
	set_pte_at(mm, addr, pte, ptent);

sequence in mm/mprotect.c:change_pte_range().

This is bad for Xen for two reasons:

  1: ptep_get_and_clear() ends up being a xchg on the pte.  Since the
     pte page is read-only (as it must be, because Xen needs to
     control all pte updates), this traps into Xen, which then
     emulates the instruction.  Trapping into the instruction emulator
     is inherently expensive.  And,

  2: because ptep_get_and_clear has atomic-fetch-and-update semantics,
     it's impossible to implement in a way which can be batched to
     amortize the cost of trapping into the hypervisor.

This series adds the ptep_modify_prot_start() and
ptep_modify_prot_commit() operations, which change this sequence to:

	ptent = ptep_modify_prot_start(mm, addr, pte);
	ptent = pte_modify(ptent, newprot);
	/* ... */
	ptep_modify_prot_commit(mm, addr, pte, ptent);

Which looks very familiar.  And, indeed, when compiled without
CONFIG_PARAVIRT (or on a non-x86 architecture), it will end up doing
precisely the same thing as before.

However, the effective semantics are a bit different.
ptep_modify_prot_start() means "I'm reading this pte with the
intention of updating it; please don't lose any hardware pte changes
in the meantime".  And ptep_modify_prot_commit() means "Here's a new
value for the pte, but make sure you don't lose any hardware changes".

The default implementation achieves these semantics by making
ptep_modify_prot_start() set the pte to non-present, which prevents
any async hardware changes to the pte.  The ptep_modify_prot_commit()
can then just write the new value into place without having to worry
about preserving any changes, because it knows there are none.

Xen implements ptep_modify_prot_start() as a simple read of the pte.
This leaves the pte unchanged in memory, and the hardware may make
asynchronous changes to it.  It implements ptep_modify_prot_commit()
using a batched hypercall which preserves the state of the
Access/Dirty bits when updating the pte.  This allows the whole
change_pte_range() loop to be run without any synchronous unbatched
traps into the hypervisor.  With this change in place, an mprotect
microbenchmark goes from being 50x worse than native to around 7x,
which is acceptible.

I believe that other virtualization systems, whether they use direct
paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can
make use of this interface to improve the performance.

Unfortunately (or fortunately) there aren't very many other areas of
the kernel which can really take advantage of this.  There's only a
couple of other instances of ptep_get_and_clear() in mm/, and they're
being used in a similar way; but I don't think they're very
performance critical (though zap_pte_range might be interesting).

In general, mprotect is rarely a performance bottleneck.  But some
debugging libraries (such as electric fence) and garbage collectors
can be very heavy users of mprotect, and this change could materially
benefit them.

Thanks,
	J

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ