linux-kernel - Re: State of the Page (August 2022)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <YvjVNBBgLxEy4xbQ@casper.infradead.org>
Date:   Sun, 14 Aug 2022 11:57:56 +0100
From:   Matthew Wilcox <willy@...radead.org>
To:     Mike Rapoport <rppt@...nel.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org
Subject: Re: State of the Page (August 2022)

On Sat, Aug 13, 2022 at 06:21:12PM +0300, Mike Rapoport wrote:
> > For some users, the size of struct page is simply too large.  At 64
> > bytes per 4KiB page, memmap occupies 1.6% of memory.  If we can get
> > struct page down to an 8 byte tagged pointer, it will be 0.2% of memory,
> > which is an acceptable overhead.
> > 
> >    struct page {
> >       unsigned long mem_desc;
> >    };
> 
> This is 0.2% for a system that does not have any actual memdescs.
> 
> Do you have an estimate how much memory will be used by the memdescs, at
> least for some use-cases?

Sure.  For SLUB, we can see it today,

struct slab {
        unsigned long __page_flags;
        union {
                struct list_head slab_list;
                struct rcu_head rcu_head;
        };
        struct kmem_cache *slab_cache;
        /* Double-word boundary */
        void *freelist;         /* first free object */
        union {
                unsigned long counters;
                struct {
                        unsigned inuse:16;
                        unsigned objects:15;
                        unsigned frozen:1;
                };
        };
        unsigned int __unused;
        atomic_t __page_refcount;
#ifdef CONFIG_MEMCG
        unsigned long memcg_data;
#endif
};

That's 8 words on 64-bit, or 64 bytes.  We'll get to remove __unused and
__page_refcount which brings us back down to 56 bytes, but we'll need to
add a pointer to struct page, bringing us back up to 64 bytes.  Note
that this is per-allocation, so to calculate the amount of space used on
your system, you need to take each line like this:

radix_tree_node   189800 278348    584   28    4 : tunables    0    0    0 : slabdata   9941   9941      0

That last number before the first colon is the number of pages per slab,
so my system has currently allocated 9941 slabs, each with 4 pages in
it.  Current memory consumption is 64 * 4 * 9941 = ~2.5MB.  With
separately allocated memdescs, it's 8 * 4 * 9941 + 64 * 9941, or just
under 1MB.  Would need to repeat this calculation for each line of
slabinfo.

For other users, it depends how they evolve.  In my quick sketch, I
decided that adding pfn to struct folio was a good idea, but adding
a pointer to the page wasn't needed (for the few times it's needed,
we can call pfn_to_page()).  So struct folio will grow from 64 bytes
to 72 in order to add the pfn.  We'll also need to include the size
of subsequent fields currently stored in page[1], so dtor, order,
total_mapcount and pincount, bumping large folios up to 88 bytes.
If the mean size of a folio is 2 pages, then it's 88 + 2 * 8 = 104 bytes
per allocation instead of the current 128 bytes.  So it's still a win,
as long as we don't cache a lot of files less than 4kB.

> Another thing, we are very strict about keeping struct page at its current
> size. Don't you think it will be much more tempting to grow either of
> memdescs and for some use cases the overhead will be at least as big as
> now?

Possibly!  But we get to make that choice.  If the networking people want
to grow the size of the netpool memdesc, you and I don't need to care.
They don't need to negotiate with the MM people about the tradeoffs
involved, they can just do it, benchmark, and decide whether it makes
sense to them.

This is more of an opportunity than a potential downside.  Maybe we can
get rid of page_ext.  Yes, people who enable the features in page_ext
will see their memdescs grow, but they've got rid of the memdesc array
in the process.

Thanks for the feedback.