linux-kernel - Re: [PATCH v3 2/2] xfs: make xfs_log_iovec independent from xfs_log

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZoH9gVVlwMkQO1dm@dread.disaster.area>
Date: Mon, 1 Jul 2024 10:51:13 +1000
From: Dave Chinner <david@...morbit.com>
To: alexjlzheng@...il.com
Cc: chandan.babu@...cle.com, djwong@...nel.org, hch@...radead.org,
	linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
	alexjlzheng@...cent.com
Subject: Re: [PATCH v3 2/2] xfs: make xfs_log_iovec independent from
 xfs_log_vec and free it early

On Wed, Jun 26, 2024 at 12:49:09PM +0800, alexjlzheng@...il.com wrote:
> From: Jinliang Zheng <alexjlzheng@...cent.com>
> 
> When the contents of the xfs_log_vec/xfs_log_iovec combination are
> written to iclog, xfs_log_iovec loses its meaning in continuing to exist
> in memory, because iclog already has a copy of its contents. We only
> need to keep xfs_log_vec that takes up very little memory to find the
> xfs_log_item that needs to be added to AIL after we flush the iclog into
> the disk log space.
> 
> Because xfs_log_iovec dominates most of the memory in the
> xfs_log_vec/xfs_log_iovec combination, retaining xfs_log_iovec until
> iclog is flushed into the disk log space and releasing together with
> xfs_log_vec is a significant waste of memory.

Have you measured this? Please provide numbers and the workload that
generates them, because when I did this combined structure the
numbers and performance measured came out decisively on the side of
"almost no difference in memory usage, major performance cost to
doing a second allocation"...

Here's the logic - the iovec array is largely "free" with the larger
data allocation.

------

Look at how the heap is structured - it is in power of 2 slab sizes:

$ grep kmalloc /proc/slabinfo |tail -13
kmalloc-8k           949    976   8192    4    8 : tunables    0    0    0 : slabdata    244    244      0
kmalloc-4k          1706   1768   4096    8    8 : tunables    0    0    0 : slabdata    221    221      0
kmalloc-2k          3252   3312   2048   16    8 : tunables    0    0    0 : slabdata    207    207      0
kmalloc-1k         76110  96192   1024   32    8 : tunables    0    0    0 : slabdata   3006   3006      0
kmalloc-512        71753  98656    512   32    4 : tunables    0    0    0 : slabdata   3083   3083      0
kmalloc-256        71006  71520    256   32    2 : tunables    0    0    0 : slabdata   2235   2235      0
kmalloc-192        10304  10458    192   42    2 : tunables    0    0    0 : slabdata    249    249      0
kmalloc-128         8889   9280    128   32    1 : tunables    0    0    0 : slabdata    290    290      0
kmalloc-96         13583  13902     96   42    1 : tunables    0    0    0 : slabdata    331    331      0
kmalloc-64         63116  64640     64   64    1 : tunables    0    0    0 : slabdata   1010   1010      0
kmalloc-32        552726 582272     32  128    1 : tunables    0    0    0 : slabdata   4549   4549      0
kmalloc-16        444768 445440     16  256    1 : tunables    0    0    0 : slabdata   1740   1740      0
kmalloc-8          18178  18432      8  512    1 : tunables    0    0    0 : slabdata     36     36      0

IOws, if we do a 260 byte allocation, we get the same sized memory
chunk as a 512 byte allocation as they come from the same slab
cache.

If we now look at structure sizes - the common ones are buffers
and inodes so we'll look at then.

For an inode, we typically log something like this for an extent
allocation (or free) on mostly contiguous inode (say less than 10
extents)

vec 1:	inode log format item
vec 2:	inode core
vec 3:	inode data fork

Each of these vectors has a 12 byte log op header built into them,
and some padding to round them out to 8 byte alignment.

vec 1:	inode log format item:	12 + 56 + 4 (pad)
vec 2:	inode core:		12 + 176 + 4 (pad)
vec 3:	inode data fork:	12 + 16 (minimum) + 4 (pad)
				12 + 336 (maximum for 512 byte inode)

If we are just logging the inode core, we are allocating
12 + 56 + 4 + 12 + 176 + 4 = 264 bytes.

It should be obvious now that this must be allocated from the 512
byte slab, and that means we have another 248 bytes of unused space
in that allocated region we can actually use -for free-.

IOWs, the fact that we add 32 bytes for the 2 iovecs for to index
this inode log item doesn't matter at all - it's free space on the
heap. Indeed, it's not until the inode data fork gets to a couple of
hundred bytes in length that we overflow the 512 byte slab and have
to use the 1kB slab. Again, we get the iovec array space for free.

If we are logging the entire inode with the data fork, then the
size of the data being logged is 264 + 12 + 336 + 4 = 616 bytes.
This is well over the 512 byte slab, so we are always going to be
allocating from the 1kB slab. We get the iovec array for free the
moment we go over the 512 byte threshold again.

IOWs, all the separation of the iovec array does is slightly change
the data/attr fork size thresholds where we go from using the 512
byte slab to the 1kB slab.

A similar pattern holds out for the buffer log items.  The minimum
it will be is:

vec 1:	buf log format item
vec 2:	single 128 byte chunk

This requires 12 + 40B + 4 + 12 + 128B + 4 = 200 bytes. For two
vectors, we need 32 bytes for the iovec array, so a total of 232
bytes is needed, and this will fit in a 256 byte slab with or
without the iovec array attached.

The same situation occurs are we increase the number of logged
regions or the size of the logged regions - in almost all cases we
get the iovec array for free because we log 128 byte regions out of
buffers and they will put us into the next largest size slab
regardless of the memory used by the iovec array.

Hence we should almost always get the space for the iovec array for
free from the slab allocator, and separating it out doesn't actually
reduce slab cache memory usage. If anything, it increases it,
because now we are allocating the iovec array out of small slabs and
so instead of it being "free" the memory usage is now accounted to
smaller slabs...

-----

Hence before we go any further with this patch set, I'd like to see
see numbers that quantify how much extra memory the embedded iovec
array is actually costing us. And from that, an explanation of why
the above "iovec array space should be cost-free" logic isn't
working as intended....

-Dave.
-- 
Dave Chinner
david@...morbit.com