lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cecd2164-ebfd-382b-af73-992cdc4304b7@amd.com>
Date:   Mon, 28 Jun 2021 12:46:59 -0400
From:   Felix Kuehling <felix.kuehling@....com>
To:     Ralph Campbell <rcampbell@...dia.com>,
        Alex Sierra <alex.sierra@....com>, akpm@...ux-foundation.org,
        linux-mm@...ck.org, linux-ext4@...r.kernel.org,
        linux-xfs@...r.kernel.org, "Yang, Philip" <Philip.Yang@....com>
Cc:     amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
        hch@....de, jgg@...dia.com, jglisse@...hat.com
Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount


Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell <rcampbell@...dia.com>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell <rcampbell@...dia.com>
>> Signed-off-by: Alex Sierra <alex.sierra@....com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>   fs/dax.c                               |  4 +-
>>   include/linux/dax.h                    |  2 +-
>>   include/linux/memremap.h               |  7 +--
>>   include/linux/mm.h                     | 44 -----------------
>>   lib/test_hmm.c                         |  2 +-
>>   mm/internal.h                          |  8 +++
>>   mm/memremap.c                          | 68 +++++++-------------------
>>   mm/migrate.c                           |  5 --
>>   mm/page_alloc.c                        |  3 ++
>>   mm/swap.c                              | 45 ++---------------
>>   12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
>   dev_dax_fault()
>     dev_dax_huge_fault()
>       __dev_dax_pte_fault()
>         vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?


>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(&page->_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(&page->_refcount,
./fs/ext4/inode.c-                atomic_read(&page->_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(&page->_refcount,
./fs/fuse/dax.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(&page->_refcount,
./fs/xfs/xfs_file.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?

Regards,
  Felix


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ