linux-kernel - Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161012064320.GA13896@quack2.suse.cz>
Date:   Wed, 12 Oct 2016 08:43:20 +0200
From:   Jan Kara <jack@...e.cz>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     Jan Kara <jack@...e.cz>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Jan Kara <jack@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Hugh Dickins <hughd@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-block@...r.kernel.org
Subject: Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages()
 can discard huge pages

On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > invalidate_inode_page() has expectation about page_count() of the page
> > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > dropped. That condition almost never met for THPs -- tail pages are
> > > pinned to the pagevec.
> > > 
> > > Let's drop them, before calling invalidate_inode_page().
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> > > ---
> > >  mm/truncate.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index a01cce450a26..ce904e4b1708 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > >  				/* 'end' is in the middle of THP */
> > >  				if (index ==  round_down(end, HPAGE_PMD_NR))
> > >  					continue;
> > > +				/*
> > > +				 * invalidate_inode_page() expects
> > > +				 * page_count(page) == 2 to drop page from page
> > > +				 * cache -- drop tail pages references.
> > > +				 */
> > > +				get_page(page);
> > > +				pagevec_release(&pvec);
> > 
> > I'm not quite sure why this is needed. When you have multiorder entry in
> > the radix tree for your huge page, then you should not get more entries in
> > the pagevec for your huge page. What do I miss?
> 
> For compatibility reason find_get_entries() (which is called by
> pagevec_lookup_entries()) collects all subpages of huge page in the range
> (head/tails). See patch [07/41]
> 
> So huge page, which is fully in the range it will be pinned up to
> PAGEVEC_SIZE times.

Yeah, I see. But then won't it be cleaner to provide iteration method that
would add to pagevec each radix tree entry (regardless of its order) only
once and then use it in places where we care? Instead of strange dances
like you do here?

Ultimately we could convert all the places to use these new iteration
methods but I don't see that as immediately necessary and maybe there are
places where getting all the subpages in the pagevec actually makes life
simpler for us (please point me if you know about such place).

On a somewhat unrelated note: I've noticed that you don't invalidate
a huge page when only part of it should be invalidated. That actually
breaks some assumptions filesystems make. In particular direct IO code
assumes that if you do

	filemap_write_and_wait_range(inode, start, end);
	invalidate_inode_pages2_range(inode, start, end);

all the page cache covering start-end *will* be invalidated. Your skipping
of partial pages breaks this assumption and thus can bring consistency
issues (e.g. write done using direct IO won't be seen by following buffered
read).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR