linux-kernel - Re: [PATCH v34 05/13] mm/damon: Implement primitives for the virtual memory address spaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b094493-9c1e-6024-bfd5-7eca66399b7e@redhat.com>
Date:   Thu, 26 Aug 2021 23:42:19 +0200
From:   David Hildenbrand <david@...hat.com>
To:     SeongJae Park <sj38.park@...il.com>
Cc:     akpm@...ux-foundation.org, markubo@...zon.com,
        SeongJae Park <sjpark@...zon.de>, Jonathan.Cameron@...wei.com,
        acme@...nel.org, alexander.shishkin@...ux.intel.com,
        amit@...nel.org, benh@...nel.crashing.org,
        brendanhiggins@...gle.com, corbet@....net, dwmw@...zon.com,
        elver@...gle.com, fan.du@...el.com, foersleo@...zon.de,
        greg@...ah.com, gthelen@...gle.com, guoju.fgj@...baba-inc.com,
        jgowans@...zon.com, joe@...ches.com, mgorman@...e.de,
        mheyne@...zon.de, minchan@...nel.org, mingo@...hat.com,
        namhyung@...nel.org, peterz@...radead.org, riel@...riel.com,
        rientjes@...gle.com, rostedt@...dmis.org, rppt@...nel.org,
        shakeelb@...gle.com, shuah@...nel.org, sieberf@...zon.com,
        snu@...le79.org, vbabka@...e.cz, vdavydov.dev@...il.com,
        zgf574564920@...il.com, linux-damon@...zon.com, linux-mm@...ck.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v34 05/13] mm/damon: Implement primitives for the virtual
 memory address spaces

On 26.08.21 19:29, SeongJae Park wrote:
> From: SeongJae Park <sjpark@...zon.de>
> 
> Hello David,
> 
> 
> On Thu, 26 Aug 2021 16:09:23 +0200 David Hildenbrand <david@...hat.com> wrote:
> 
>>> +static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
>>> +{
>>> +	pte_t *pte = NULL;
>>> +	pmd_t *pmd = NULL;
>>> +	spinlock_t *ptl;
>>> +
>>
>> I just stumbled over this, sorry for the dumb questions:
> 
> Appreciate for the great questions!
> 
>>
>>
>> a) What do we know about that region we are messing with?
>>
>> AFAIU, just like follow_pte() and follow_pfn(), follow_invalidate_pte()
>> should only be called on VM_IO and raw VM_PFNMAP mappings in general
>> (see the doc of follow_pte()). Do you even know that it's within a
>> single VMA and that there are no concurrent modifications?
> 
> We have no idea about the region at this moment.  However, if we successfully
> get the pte or pmd under the protection of the page table lock, we ensure the
> page for the pte or pmd is a online LRU-page with damon_get_page(), before
> updating the pte or pmd's PAGE_ACCESSED bit.  We release the page table lock
> only after the update.
> 
> And concurrent VMA change doesn't matter here because we read and write only
> the page table.  If the address is not mapped or not backed by LRU pages, we
> simply treat it as not accessed.

reading/writing page tables is the real problem.

> 
>>
>> b) Which locks are we holding?
>>
>> I hope we're holding the mmap lock in read mode at least. Or how are you
>> making sure there are no concurrent modifications to page tables / VMA
>> layout ... ?
>>
>>> +	if (follow_invalidate_pte(mm, addr, NULL, &pte, &pmd, &ptl))
> 
> All the operations are protected by the page table lock of the pte or pmd, so
> no concurrent page table modification would happen.  As previously mentioned,
> because we read and update only page table, we don't care about VMAs and
> therefore we don't need to hold mmap lock here.

See below, that's unfortunately not sufficient.

> 
> Outside of this function, DAMON reads the VMAs to know which address ranges are
> not mapped, and avoid inefficiently checking access to the area with the
> information.  Nevertheless, it happens only occasionally (once per 60 seconds
> by default), and it holds the mmap read lock in the case.
> 
> Nonetheless, I agree the usage of follow_invalidate_pte() here could make
> readers very confusing.  It would be better to implement and use DAMON's own
> page table walk logic.  Of course, I might missing something important.  If you
> think so, please don't hesitate at yelling to me.


I'm certainly not going to yell :) But unfortunately I'll have to tell 
you that what you are doing is in my understanding fundamentally broken.

See, page tables might get removed any time
a) By munmap() code even while holding the mmap semaphore in read (!)
b) By khugepaged holding the mmap lock in write mode

The rules are (ignoring the rmap side of things)

a) You can walk page tables inside a known VMA with the mmap semaphore 
held in read mode. If you drop the mmap sem, you have to re-validate the 
VMA! Anything could have changed in the meantime. This is essentially 
what mm/pagewalk.c does.

b) You can walk page tables ignoring VMAs with the mmap semaphore held 
in write mode.

c) You can walk page tables lockless if the architecture supports it and 
you have interrupts disabled the hole time. But you are not allowed to 
write.

With what you're doing, you might end up reading random garbage as page 
table pointers, or writing random garbage to pages that are no longer 
used as page tables.

Take a look at mm/gup.c:lockless_pages_from_mm() to see how difficult it 
is to walk page tables lockless. And it only works because page table 
freeing code synchronizes either via IPI or fake-rcu before actually 
freeing a page table.

follow_invalidate_pte() is, in general, the wrong thing to use. It's 
specialized to VM_IO and VM_PFNMAP. take a look at the difference in 
complexity between follow_invalidate_pte() and mm/pagewalk.c!

I'm really sorry, but as far as I can tell, this is locking-wise broken 
and follow_invalidate_pte() is the wrong interface to use here.

Someone can most certainly correct me if I'm wrong, or if I'm missing 
something regarding your implementation, but if you take a look around, 
you won't find any code walking page tables without at least holding the 
mmap sem in read mode -- for a good reason.

-- 
Thanks,

David / dhildenb