lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 12 Oct 2016 08:43:20 +0200
From:   Jan Kara <jack@...e.cz>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     Jan Kara <jack@...e.cz>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Jan Kara <jack@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Hugh Dickins <hughd@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-block@...r.kernel.org
Subject: Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages()
 can discard huge pages

On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > invalidate_inode_page() has expectation about page_count() of the page
> > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > dropped. That condition almost never met for THPs -- tail pages are
> > > pinned to the pagevec.
> > > 
> > > Let's drop them, before calling invalidate_inode_page().
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> > > ---
> > >  mm/truncate.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index a01cce450a26..ce904e4b1708 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > >  				/* 'end' is in the middle of THP */
> > >  				if (index ==  round_down(end, HPAGE_PMD_NR))
> > >  					continue;
> > > +				/*
> > > +				 * invalidate_inode_page() expects
> > > +				 * page_count(page) == 2 to drop page from page
> > > +				 * cache -- drop tail pages references.
> > > +				 */
> > > +				get_page(page);
> > > +				pagevec_release(&pvec);
> > 
> > I'm not quite sure why this is needed. When you have multiorder entry in
> > the radix tree for your huge page, then you should not get more entries in
> > the pagevec for your huge page. What do I miss?
> 
> For compatibility reason find_get_entries() (which is called by
> pagevec_lookup_entries()) collects all subpages of huge page in the range
> (head/tails). See patch [07/41]
> 
> So huge page, which is fully in the range it will be pinned up to
> PAGEVEC_SIZE times.

Yeah, I see. But then won't it be cleaner to provide iteration method that
would add to pagevec each radix tree entry (regardless of its order) only
once and then use it in places where we care? Instead of strange dances
like you do here?

Ultimately we could convert all the places to use these new iteration
methods but I don't see that as immediately necessary and maybe there are
places where getting all the subpages in the pagevec actually makes life
simpler for us (please point me if you know about such place).

On a somewhat unrelated note: I've noticed that you don't invalidate
a huge page when only part of it should be invalidated. That actually
breaks some assumptions filesystems make. In particular direct IO code
assumes that if you do

	filemap_write_and_wait_range(inode, start, end);
	invalidate_inode_pages2_range(inode, start, end);

all the page cache covering start-end *will* be invalidated. Your skipping
of partial pages breaks this assumption and thus can bring consistency
issues (e.g. write done using direct IO won't be seen by following buffered
read).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ