lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160523214942.GA79646@black.fi.intel.com>
Date:	Tue, 24 May 2016 00:49:42 +0300
From:	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To:	Rik van Riel <riel@...hat.com>
Cc:	"Kirill A. Shutemov" <kirill@...temov.name>,
	Michal Hocko <mhocko@...nel.org>,
	Ebru Akagunduz <ebru.akagunduz@...il.com>, linux-mm@...ck.org,
	hughd@...gle.com, akpm@...ux-foundation.org,
	n-horiguchi@...jp.nec.com, aarcange@...hat.com,
	iamjoonsoo.kim@....com, gorcunov@...nvz.org,
	linux-kernel@...r.kernel.org, mgorman@...e.de, rientjes@...gle.com,
	vbabka@...e.cz, aneesh.kumar@...ux.vnet.ibm.com,
	hannes@...xchg.org, boaz@...xistor.com
Subject: Re: [PATCH 3/3] mm, thp: make swapin readahead under down_read of
 mmap_sem

On Mon, May 23, 2016 at 04:13:03PM -0400, Rik van Riel wrote:
> On Mon, 2016-05-23 at 23:02 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 23, 2016 at 03:26:47PM -0400, Rik van Riel wrote:
> > > 
> > > On Mon, 2016-05-23 at 22:01 +0300, Kirill A. Shutemov wrote:
> > > > 
> > > > On Mon, May 23, 2016 at 02:49:09PM -0400, Rik van Riel wrote:
> > > > > 
> > > > > 
> > > > > On Mon, 2016-05-23 at 20:42 +0200, Michal Hocko wrote:
> > > > > > 
> > > > > > 
> > > > > > On Mon 23-05-16 20:14:11, Ebru Akagunduz wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Currently khugepaged makes swapin readahead under
> > > > > > > down_write. This patch supplies to make swapin
> > > > > > > readahead under down_read instead of down_write.
> > > > > > You are still keeping down_write. Can we do without it
> > > > > > altogether?
> > > > > > Blocking mmap_sem of a remote proces for write is certainly
> > > > > > not
> > > > > > nice.
> > > > > Maybe Andrea can explain why khugepaged requires
> > > > > a down_write of mmap_sem?
> > > > > 
> > > > > If it were possible to have just down_read that
> > > > > would make the code a lot simpler.
> > > > You need a down_write() to retract page table. We need to make
> > > > sure
> > > > that
> > > > nobody sees the page table before we can replace it with huge
> > > > pmd.
> > > Good point.
> > > 
> > > I guess the alternative is to have the page_table_lock
> > > taken by a helper function (everywhere) that can return
> > > failure if the page table was changed while the caller
> > > was waiting for the lock.
> > Not page table was changed, but pmd is now pointing to something
> > else.
> > Basically, we would need to nest all pte-ptl's within pmd_lock().
> > That's not good for scalability.
> 
> I can see a few alternatives here:
> 
> 1) huge pmd collapsing takes both the pmd lock and the pte lock,
>    preventing pte updates from happening simultaneously

That's what we do now and that's not enough.

We would need to serialize against pmd_lock() during normal page-fault
path (and other pte manipulation), which we don't do now if pmd points to
page table.

That's huge hit on scalability.

> 
> 2) code that (re-)acquires the pte lock can read a sequence number
>    at the pmd level, check that it did not change after the
>    pte lock has been acquired, and abort if it has - I believe most
>    of the code that re-acquires the pte lock already knows how to
>    abort if somebody else touched the pte while it was looking
>    elsewhere

So, every pmd_lock() (and other means we take the lock) should bump the
sequence number and we need to be able to read stable result  outside
pmd_lock(), meaning it should be atomic_t or something similar.

Not exactly free.

And I'm not convinced the hassle worth the gain.

> That way the (uncommon) thp collapse code should still exclude
> pte level operations, at the cost of potentially teaching a few
> more pte level operations to abort (chances are most already do,
> considering a race with other pte-level manipulations requires that).

-- 
 Kirill A. Shutemov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ