[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YfGkxtQd0KE8YNXt@casper.infradead.org>
Date: Wed, 26 Jan 2022 19:45:10 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
linux-m68k@...ts.linux-m68k.org,
Anshuman Khandual <anshuman.khandual@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
william.kucharski@...cle.com,
Mike Kravetz <mike.kravetz@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
schmitzmic@...il.com, Steven Rostedt <rostedt@...dmis.org>,
Ingo Molnar <mingo@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Roman Gushchin <guro@...com>,
Muchun Song <songmuchun@...edance.com>,
Wei Xu <weixugc@...gle.com>, Greg Thelen <gthelen@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Paul Turner <pjt@...gle.com>, Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH v3 1/9] mm: add overflow and underflow checks for
page->_refcount
On Wed, Jan 26, 2022 at 02:22:26PM -0500, Pasha Tatashin wrote:
> On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote:
> > > The problems with page->_refcount are hard to debug, because usually
> > > when they are detected, the damage has occurred a long time ago. Yet,
> > > the problems with invalid page refcount may be catastrophic and lead to
> > > memory corruptions.
> > >
> > > Reduce the scope of when the _refcount problems manifest themselves by
> > > adding checks for underflows and overflows into functions that modify
> > > _refcount.
> >
> > If you're chasing a bug like this, presumably you turn on page
> > tracepoints. So could we reduce the cost of this by putting the
> > VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to
> > change the arguments to those functions to pass in old & new, but that
> > should be a cheap change compared to embedding the VM_BUG_ON_PAGE.
>
> This is not only about chasing a bug. This also about preventing
> memory corruption and information leaking that are caused by ref_count
> bugs from happening.
> Several months ago a memory corruption bug was discovered by accident:
> an engineer was studying a process core from a production system and
> noticed that some memory does not look like it belongs to the original
> process. We tried to manually reproduce that bug but failed. However,
> later analysis by our team, explained that the problem occured due to
> ref_count bug in Linux, and the bug itself was root caused and fixed
> (mentioned in the cover letter). This work would have prevented
> similar ref_count bugs from yielding to the memory corruption
> situation.
But the VM_BUG_ON_PAGE tells us next to nothing useful. To take
your first example [1] as the kind of thing you say this is going to
help fix:
1. Page p is allocated by thread a (refcount 1)
2. Thread b gets mistaken pointer to p
3. Thread b calls put_page(), __put_page(), page goes to memory
allocator.
4. Thread c calls alloc_page(), also gets page p (refcount 1 again).
5. Thread a calls put_page(), __put_page()
6. Thread c calls put_page() and gets a VM_BUG_ON_PAGE.
How do we find thread b's involvement? I don't think we can even see
thread a's involvement in all of this! All we know is a backtrace
pointing to thread c, who is a completely innocent bystander. I think
you have to enable page tracepoints to have any shot at finding thread
b's involvement.
[1] https://lore.kernel.org/stable/20211122171825.1582436-1-gthelen@google.com/
Powered by blists - more mailing lists