lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jv6v6bgvh2uidqqeava72pjh2d5uehtyim74r3gatxn6v2uebh@t3lbrkhh6fzw>
Date: Wed, 4 Sep 2024 14:40:34 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Nam Cao <namcao@...utronix.de>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        bigeasy@...utronix.de
Subject: Re: [PATCH] x86/mm/pat: Support splitting of virtual memory areas

* Nam Cao <namcao@...utronix.de> [240904 03:59]:
> On Tue, Sep 03, 2024 at 11:56:57AM -0400, Liam R. Howlett wrote:
> > * Nam Cao <namcao@...utronix.de> [240903 06:36]:
> ...
> > > On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote:
> > > > * Nam Cao <namcao@...utronix.de> [240827 03:59]:
> > > > > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote:
> > > > > > * Nam Cao <namcao@...utronix.de> [240825 11:29]:
> ...
> > > > > > > 
> > > > > > > with the physical address starting from 0xfd000000, the range
> > > > > > > (0xfd000000-0xfd002000) would be tracked with the mmap() call.
> > > > > > > 
> > > > > > > After mprotect(), the initial range gets splitted into
> > > > > > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000).
> > > > > > > 
> > > > > > > Then, at munmap(), the first range does not match any entry in
> > > > > > > memtype_rbroot, and a message is seen in dmesg:
> > > > > > > 
> > > > > > >     x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff]
> > > > > > > 
> > > > > > > The second range still matches by accident, because matching only the end
> > > > > > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94
> > > > > > > (x86/mm/pat: Change free_memtype() to support shrinking case)).
> > > > > > 
> > > > > > Does this need a fixes tag?
> > > > > 
> > > > > Yes, it should have
> > > > > 	Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch")
> > > > > thanks for the reminder.
> > > > 
> > > > That commit is from 2008, is there a bug report on this issue?
> > > 
> > > Not that I am aware of. I'm not entirely sure why, but I would guess due to
> > > the combination of:
> > > - This is not an issue for pages in RAM
> > > - This only happens if VMAs are splitted
> > > - The only user-visible effect is merely a pr_info(), and people may miss it.
> > > 
> > > I only encountered this issue while "trying to be smart" with mprotect() on
> > > a portion of mmap()-ed device memory, I guess probably not many people do
> > > that.
> > 
> > Or test it.  I would have though some bots would have caught this.
> > Although the log message is just pr_info()?  That seems wrong - we have
> > an error in the vma tree or the PAT tree and it's just an info printk?
> 
> Yeah right, I think pr_info() is another issue, it should be pr_warn() or
> pr_err(). That is probably another patch.

Agreed.

> 
> ...
> > > > 
> > > > So the interval split should occur when the PAT changes and needs to be
> > > > tracked differently.  This does not happen when the vma is split - it
> > > > happens when a vma is removed or when the PAT is changed.
> > > > 
> > > > And, indeed, for the mremap() shrinking case, you already support
> > > > finding a range by just the end and have an abstraction layer.  The
> > > > problem here is that you don't check by the start - but you could.  You
> > > > could make the change to memtype_erase() to search for the exact, end,
> > > > or start and do what is necessary to shrink off the front of a region as
> > > > well.
> > > 
> > > I thought about this solution initially, but since the interval tree allow
> > > overlapping ranges, it can be tricky to determine the "best match" out
> > > of the overlapping ranges. But I agree that this approach (if possible)
> > > would be better than the current patch.
> > > 
> > > Let me think about this some more, and I will come back later.
> > 
> > Reading this some more, I believe you can detect the correct address by
> > matching the start address with the smallest end address (the smallest
> > interval has to be the entry created by the vma mapping).
> 
> I don't think that would cover all cases. For example, if the tree has 2
> intervals: [0x0000-0x2000] and [0x1000-0x3000]. Now, the mm subsystem tells
> us that the interval [0x1000-0x2000] needs to be removed (e.g. user does
> munmap()), your proposal would match this to the second interval. After the
> removal, the tree has [0-0x2000] and [0x2000-0x3000]
> 
> Then, mm subsystem says [0x1000-0x3000] should be removed, and that doesn't
> match anything. Turns out, the first removal was meant for the first
> interval, but we didn't have enough information at the time to determine
> that.
> 
> Bottom line is, it is not possible to correctly match [0x1000-0x2000] to
> [0x0000-0x2000] and [0x1000-0x3000]: both matches can be valid.

But those ranges won't exist.  What appears to be happening in this code
is that there are higher levels of non-overlapping ranges with
memory (cache) types (or none are defined) , which are tracked on page
granularity.  So we can't have a page that has two memory type.

The overlapping happens later, when the vmas are mapped.  And we are
ensuring that the mapping of the vmas match the higher, larger areas.
The vmas are inserted with memtype_check_insert() which calls
memtype_check_conflict() that ensures any overlapping areas have the
same type as the one being added, so either there is no match or the
interval(s) with this page is set to a specific type.  I suspect there
can only really be one range.

So I don't think overlapping areas like above could exist.  The vma
cache type has to be the same throughout. It has to be the same type as
all overlapping areas.

Also, your ranges are inclusive while the ranges passed in seem to be
exclusive on the end address, so your example would look more like:
[0x0000-0x2000) [0x2000-0x3000).

You can see this documented in memtype_reserve() where sanitize_phys()
is called.

So we could have a VMA of [0x1000-0x2000), but this vma would have to be
in the first range.  [0x0000-0x0FFF) would also be in the first range.

I think that searching for the smallest area containing the entry will
yield the desired entry in the interval tree.

Note that there is debugging support in the Documentation so you can go
look at what is in there with debugfs.

...

> One solution I can think of: stop allowing overlapping intervals. Instead,
> the overlapping portions would be split into new intervals with some
> reference counting. memtype_erase() would need to be modified to:
>   - assemble the potentially split intervals
>   - split the intervals if needed
> The point is, there wouldn't be any confusion with matching overlapping
> intervals.
> 
> I will give it a try when I have some time, unless someone sees a problem
> with it or has a better idea.

I don't think this will work at all.  It is dependent of overlapping
ranges to ensure the vmas match what is allowed in certain areas.

Thanks,
Liam

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ