[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161024104102.GA2849@node.shutemov.name>
Date: Mon, 24 Oct 2016 13:41:02 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Jan Kara <jack@...e.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Theodore Ts'o <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>,
Jan Kara <jack@...e.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Hugh Dickins <hughd@...gle.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Dave Hansen <dave.hansen@...el.com>,
Vlastimil Babka <vbabka@...e.cz>,
Matthew Wilcox <willy@...radead.org>,
Ross Zwisler <ross.zwisler@...ux.intel.com>,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-block@...r.kernel.org
Subject: Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages()
can discard huge pages
On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote:
> On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > > invalidate_inode_page() has expectation about page_count() of the page
> > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > > dropped. That condition almost never met for THPs -- tail pages are
> > > > pinned to the pagevec.
> > > >
> > > > Let's drop them, before calling invalidate_inode_page().
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> > > > ---
> > > > mm/truncate.c | 11 +++++++++++
> > > > 1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > > index a01cce450a26..ce904e4b1708 100644
> > > > --- a/mm/truncate.c
> > > > +++ b/mm/truncate.c
> > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > > /* 'end' is in the middle of THP */
> > > > if (index == round_down(end, HPAGE_PMD_NR))
> > > > continue;
> > > > + /*
> > > > + * invalidate_inode_page() expects
> > > > + * page_count(page) == 2 to drop page from page
> > > > + * cache -- drop tail pages references.
> > > > + */
> > > > + get_page(page);
> > > > + pagevec_release(&pvec);
> > >
> > > I'm not quite sure why this is needed. When you have multiorder entry in
> > > the radix tree for your huge page, then you should not get more entries in
> > > the pagevec for your huge page. What do I miss?
> >
> > For compatibility reason find_get_entries() (which is called by
> > pagevec_lookup_entries()) collects all subpages of huge page in the range
> > (head/tails). See patch [07/41]
> >
> > So huge page, which is fully in the range it will be pinned up to
> > PAGEVEC_SIZE times.
>
> Yeah, I see. But then won't it be cleaner to provide iteration method that
> would add to pagevec each radix tree entry (regardless of its order) only
> once and then use it in places where we care? Instead of strange dances
> like you do here?
Maybe. It would require doubling number of find_get_* helpers or
additional flag in each. We have too many already.
And multi-order entries interface for radix-tree has not yet settled in.
I would rather defer such rework until it will be shaped fully.
Let's come back to this later.
> Ultimately we could convert all the places to use these new iteration
> methods but I don't see that as immediately necessary and maybe there are
> places where getting all the subpages in the pagevec actually makes life
> simpler for us (please point me if you know about such place).
I did the way I did to now evaluate each use of find_get_*() one-by-one.
I guessed most of the callers of find_get_page() would be confused by
getting head page instead relevant subpage. Maybe I was wrong and it was
easier to make caller work with that. I don't know...
> On a somewhat unrelated note: I've noticed that you don't invalidate
> a huge page when only part of it should be invalidated. That actually
> breaks some assumptions filesystems make. In particular direct IO code
> assumes that if you do
>
> filemap_write_and_wait_range(inode, start, end);
> invalidate_inode_pages2_range(inode, start, end);
>
> all the page cache covering start-end *will* be invalidated. Your skipping
> of partial pages breaks this assumption and thus can bring consistency
> issues (e.g. write done using direct IO won't be seen by following buffered
> read).
Acctually, invalidate_inode_pages2_range does invalidate whole page if
part of it is in the range. I've catched this problem during testing.
--
Kirill A. Shutemov
Powered by blists - more mailing lists