[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZoH9gVVlwMkQO1dm@dread.disaster.area>
Date: Mon, 1 Jul 2024 10:51:13 +1000
From: Dave Chinner <david@...morbit.com>
To: alexjlzheng@...il.com
Cc: chandan.babu@...cle.com, djwong@...nel.org, hch@...radead.org,
linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
alexjlzheng@...cent.com
Subject: Re: [PATCH v3 2/2] xfs: make xfs_log_iovec independent from
xfs_log_vec and free it early
On Wed, Jun 26, 2024 at 12:49:09PM +0800, alexjlzheng@...il.com wrote:
> From: Jinliang Zheng <alexjlzheng@...cent.com>
>
> When the contents of the xfs_log_vec/xfs_log_iovec combination are
> written to iclog, xfs_log_iovec loses its meaning in continuing to exist
> in memory, because iclog already has a copy of its contents. We only
> need to keep xfs_log_vec that takes up very little memory to find the
> xfs_log_item that needs to be added to AIL after we flush the iclog into
> the disk log space.
>
> Because xfs_log_iovec dominates most of the memory in the
> xfs_log_vec/xfs_log_iovec combination, retaining xfs_log_iovec until
> iclog is flushed into the disk log space and releasing together with
> xfs_log_vec is a significant waste of memory.
Have you measured this? Please provide numbers and the workload that
generates them, because when I did this combined structure the
numbers and performance measured came out decisively on the side of
"almost no difference in memory usage, major performance cost to
doing a second allocation"...
Here's the logic - the iovec array is largely "free" with the larger
data allocation.
------
Look at how the heap is structured - it is in power of 2 slab sizes:
$ grep kmalloc /proc/slabinfo |tail -13
kmalloc-8k 949 976 8192 4 8 : tunables 0 0 0 : slabdata 244 244 0
kmalloc-4k 1706 1768 4096 8 8 : tunables 0 0 0 : slabdata 221 221 0
kmalloc-2k 3252 3312 2048 16 8 : tunables 0 0 0 : slabdata 207 207 0
kmalloc-1k 76110 96192 1024 32 8 : tunables 0 0 0 : slabdata 3006 3006 0
kmalloc-512 71753 98656 512 32 4 : tunables 0 0 0 : slabdata 3083 3083 0
kmalloc-256 71006 71520 256 32 2 : tunables 0 0 0 : slabdata 2235 2235 0
kmalloc-192 10304 10458 192 42 2 : tunables 0 0 0 : slabdata 249 249 0
kmalloc-128 8889 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
kmalloc-96 13583 13902 96 42 1 : tunables 0 0 0 : slabdata 331 331 0
kmalloc-64 63116 64640 64 64 1 : tunables 0 0 0 : slabdata 1010 1010 0
kmalloc-32 552726 582272 32 128 1 : tunables 0 0 0 : slabdata 4549 4549 0
kmalloc-16 444768 445440 16 256 1 : tunables 0 0 0 : slabdata 1740 1740 0
kmalloc-8 18178 18432 8 512 1 : tunables 0 0 0 : slabdata 36 36 0
IOws, if we do a 260 byte allocation, we get the same sized memory
chunk as a 512 byte allocation as they come from the same slab
cache.
If we now look at structure sizes - the common ones are buffers
and inodes so we'll look at then.
For an inode, we typically log something like this for an extent
allocation (or free) on mostly contiguous inode (say less than 10
extents)
vec 1: inode log format item
vec 2: inode core
vec 3: inode data fork
Each of these vectors has a 12 byte log op header built into them,
and some padding to round them out to 8 byte alignment.
vec 1: inode log format item: 12 + 56 + 4 (pad)
vec 2: inode core: 12 + 176 + 4 (pad)
vec 3: inode data fork: 12 + 16 (minimum) + 4 (pad)
12 + 336 (maximum for 512 byte inode)
If we are just logging the inode core, we are allocating
12 + 56 + 4 + 12 + 176 + 4 = 264 bytes.
It should be obvious now that this must be allocated from the 512
byte slab, and that means we have another 248 bytes of unused space
in that allocated region we can actually use -for free-.
IOWs, the fact that we add 32 bytes for the 2 iovecs for to index
this inode log item doesn't matter at all - it's free space on the
heap. Indeed, it's not until the inode data fork gets to a couple of
hundred bytes in length that we overflow the 512 byte slab and have
to use the 1kB slab. Again, we get the iovec array space for free.
If we are logging the entire inode with the data fork, then the
size of the data being logged is 264 + 12 + 336 + 4 = 616 bytes.
This is well over the 512 byte slab, so we are always going to be
allocating from the 1kB slab. We get the iovec array for free the
moment we go over the 512 byte threshold again.
IOWs, all the separation of the iovec array does is slightly change
the data/attr fork size thresholds where we go from using the 512
byte slab to the 1kB slab.
A similar pattern holds out for the buffer log items. The minimum
it will be is:
vec 1: buf log format item
vec 2: single 128 byte chunk
This requires 12 + 40B + 4 + 12 + 128B + 4 = 200 bytes. For two
vectors, we need 32 bytes for the iovec array, so a total of 232
bytes is needed, and this will fit in a 256 byte slab with or
without the iovec array attached.
The same situation occurs are we increase the number of logged
regions or the size of the logged regions - in almost all cases we
get the iovec array for free because we log 128 byte regions out of
buffers and they will put us into the next largest size slab
regardless of the memory used by the iovec array.
Hence we should almost always get the space for the iovec array for
free from the slab allocator, and separating it out doesn't actually
reduce slab cache memory usage. If anything, it increases it,
because now we are allocating the iovec array out of small slabs and
so instead of it being "free" the memory usage is now accounted to
smaller slabs...
-----
Hence before we go any further with this patch set, I'd like to see
see numbers that quantify how much extra memory the embedded iovec
array is actually costing us. And from that, an explanation of why
the above "iovec array space should be cost-free" logic isn't
working as intended....
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists