lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 4 Oct 2021 19:26:23 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Matthew Wilcox <willy@...radead.org>
cc:     Yang Shi <shy828301@...il.com>, Hao Sun <sunhao.th@...il.com>,
        Hugh Dickins <hughd@...gle.com>, Song Liu <song@...nel.org>,
        Rongwei Wang <rongwei.wang@...ux.alibaba.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        William Kucharski <william.kucharski@...cle.com>
Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page
 cache

On Mon, 4 Oct 2021, Matthew Wilcox wrote:
> On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote:
> > On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@...radead.org> wrote:
> > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@...gle.com> wrote:
> > > > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > > > or
> > > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > > > >
> > > > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > > > could well be nastier); but for the bug in this thread, I expect
> > > >
> > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > > > if it is not a regular file, just bail out. Sounds not that nasty to
> > > > me AFAIU.
> > >
> > > I don't see why we should have an S_ISREG() check.  I agree it's not the
> > > intended usecase, but it ought to work fine.  Unless there's something
> > > I'm missing?
> > 
> > Check out this bug report:
> > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > and the patch from me:
> > https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/
> > 
> > I don't think we handle buffers correctly for file THP, right? My
> > patch is ad hoc, so I thought Hugh's suggestion makes some sense to
> > me. Why do we have THP collapsed for unintended usecase in the first
> > place?
> 
> OK, I've done some more digging.  I think what's going on with this
> report is userspace opens the block device RO, causes the page cache to
> be loaded with data, then khugepaged comes in and creates THPs.

Yes.

> What confuses me is that these THPs have private data attached to them.
> I don't know how that happens.  If it's block device specific, then
> yes, something like your S_ISREG() idea should work fine.  Otherwise,
> we might need to track down another problem.

Agreed, the file THP is created without PagePrivate, so the puzzle was
why the read-only cached page would later become page_has_private().

The C repro showed that it uses (a BTRFS_IOC_ADD_DEV ioctl which might
not be relevant and) a BLKRRPART ioctl 0x125f: I didn't follow BLKRRPART
all the way down, but imagine it has to attach buffer-heads to re-read
the partition table.  Which would explain it.

Aside from that particular ioctl, it seems a good idea to insist on
S_ISREG just to shrink the attack surface: as Yang Shi says, executable
THP on block device was never an intended usecase, and not a usecase
anyone is likely to miss! And that fuzzer appears to delight in
tormenting /dev/nullb0, so let's just seal off that avenue.

You're right to have some doubt, as to whether there might be other
ways for buffer-heads to get attached, even on a read-only regular
file; but no way has sprung to my mind, and READ_ONLY_THP_FOR_FS has
survived well in its intended usage: so I think we should proceed on
the assumption that no further bugs remain - then fix them when found.

I wasn't able to reproduce the problem with the repro, would need to
waste many hours to do so.  But here's the untested S_ISREG patch I
came up with.  Sorry, I've mixed something else in: in moving the
alignment part to clarify the conditions, I was alarmed to see that
shmem with !shmem_huge_enabled was falling through to THP_FOR_FS to
give unexpected huge pages: fixed that, though later found there's
a separate shmem_huge_enabled() check which should exclude it.

--- 5.15-rc4/mm/khugepaged.c	2021-09-12 17:39:21.943438422 -0700
+++ linux/khugepaged.c	2021-10-03 20:41:13.194822795 -0700
@@ -445,22 +445,25 @@ static bool hugepage_vma_check(struct vm
 	if (!transhuge_vma_enabled(vma, vm_flags))
 		return false;
 
+	if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+					vma->vm_pgoff, HPAGE_PMD_NR))
+		return false;
+
 	/* Enabled via shmem mount options or sysfs settings. */
-	if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) {
-		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR);
-	}
+	if (shmem_file(vma->vm_file))
+		return shmem_huge_enabled(vma);
 
 	/* THP settings require madvise. */
 	if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always())
 		return false;
 
 	/* Read-only file mappings need to be aligned for THP to work. */
-	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
-	    !inode_is_open_for_write(vma->vm_file->f_inode) &&
-	    (vm_flags & VM_EXEC)) {
-		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR);
+	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+	    (vm_flags & VM_EXEC) && vma->vm_file) {
+		struct inode *inode = vma->vm_file->f_inode;
+
+	        return !inode_is_open_for_write(inode) &&
+			S_ISREG(inode->i_mode);
 	}
 
 	if (!vma->anon_vma || vma->vm_ops)

Powered by blists - more mailing lists