linux-kernel - Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <diqz5y4wfpj0.fsf@ackerleytng-ctop.c.googlers.com>
Date:   Wed, 30 Aug 2023 16:44:51 +0000
From:   Ackerley Tng <ackerleytng@...gle.com>
To:     Binbin Wu <binbin.wu@...ux.intel.com>
Cc:     seanjc@...gle.com, kvm@...r.kernel.org,
        linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
        linux-mips@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
        kvm-riscv@...ts.infradead.org, linux-riscv@...ts.infradead.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        linux-security-module@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com, maz@...nel.org,
        oliver.upton@...ux.dev, chenhuacai@...nel.org, mpe@...erman.id.au,
        anup@...infault.org, paul.walmsley@...ive.com, palmer@...belt.com,
        aou@...s.berkeley.edu, willy@...radead.org,
        akpm@...ux-foundation.org, paul@...l-moore.com, jmorris@...ei.org,
        serge@...lyn.com, chao.p.peng@...ux.intel.com, tabba@...gle.com,
        jarkko@...nel.org, yu.c.zhang@...ux.intel.com,
        vannapurve@...gle.com, mail@...iej.szmigiero.name, vbabka@...e.cz,
        david@...hat.com, qperret@...gle.com, michael.roth@....com,
        wei.w.wang@...el.com, liam.merwick@...cle.com,
        isaku.yamahata@...il.com, kirill.shutemov@...ux.intel.com
Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for
 guest-specific backing memory

Binbin Wu <binbin.wu@...ux.intel.com> writes:

>> <snip>
>>
>> +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>> +{
>> +	struct address_space *mapping = inode->i_mapping;
>> +	pgoff_t start, index, end;
>> +	int r;
>> +
>> +	/* Dedicated guest is immutable by default. */
>> +	if (offset + len > i_size_read(inode))
>> +		return -EINVAL;
>> +
>> +	filemap_invalidate_lock_shared(mapping);
>> +
>> +	start = offset >> PAGE_SHIFT;
>> +	end = (offset + len) >> PAGE_SHIFT;
>> +
>> +	r = 0;
>> +	for (index = start; index < end; ) {
>> +		struct folio *folio;
>> +
>> +		if (signal_pending(current)) {
>> +			r = -EINTR;
>> +			break;
>> +		}
>> +
>> +		folio = kvm_gmem_get_folio(inode, index);
>> +		if (!folio) {
>> +			r = -ENOMEM;
>> +			break;
>> +		}
>> +
>> +		index = folio_next_index(folio);
>> +
>> +		folio_unlock(folio);
>> +		folio_put(folio);
> May be a dumb question, why we get the folio and then put it immediately?
> Will it make the folio be released back to the page allocator?
>

I was wondering this too, but it is correct.

In filemap_grab_folio(), the refcount is incremented in three places:

+ When the folio is created in filemap_alloc_folio(), it is given a
  refcount of 1 in

    filemap_alloc_folio() -> folio_alloc() -> __folio_alloc_node() ->
    __folio_alloc() -> __alloc_pages() -> get_page_from_freelist() ->
    prep_new_page() -> post_alloc_hook() -> set_page_refcounted()

+ Then, in filemap_add_folio(), the refcount is incremented twice:

    + The first is from the filemap (1 refcount per page if this is a
      hugepage):

        filemap_add_folio() -> __filemap_add_folio() -> folio_ref_add()

    + The second is a refcount from the lru list

        filemap_add_folio() -> folio_add_lru() -> folio_get() ->
        folio_ref_inc()

In the other path, if the folio exists in the page cache (filemap), the
refcount is also incremented through

    filemap_grab_folio() -> __filemap_get_folio() -> filemap_get_entry()
    -> folio_try_get_rcu()

I believe all the branches in kvm_gmem_get_folio() are taking a refcount
on the folio while the kernel does some work on the folio like clearing
the folio in clear_highpage() or getting the next index, and then when
done, the kernel does folio_put().

This pattern is also used in shmem and hugetlb. :)

I'm not sure whose refcount the folio_put() in kvm_gmem_allocate() is
dropping though:

+ The refcount for the filemap depends on whether this is a hugepage or
  not, but folio_put() strictly drops a refcount of 1.
+ The refcount for the lru list is just 1, but doesn't the page still
  remain in the lru list?

>> +
>> +		/* 64-bit only, wrapping the index should be impossible. */
>> +		if (WARN_ON_ONCE(!index))
>> +			break;
>> +
>> +		cond_resched();
>> +	}
>> +
>> +	filemap_invalidate_unlock_shared(mapping);
>> +
>> +	return r;
>> +}
>> +
>>
>> <snip>