lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <0D534283-4B81-45CA-B7D2-E2857F0EAB85@dilger.ca>
Date:	Mon, 8 Jul 2013 15:27:21 -0600
From:	Andreas Dilger <adilger@...ger.ca>
To:	Jan Kara <jack@...e.cz>
Cc:	Lukáš Czerner <lczerner@...hat.com>,
	Alex Zhuravlev <alexey.zhuravlev@...el.com>,
	"linux-ext4@...r.kernel.org List" <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v2] ext4: Try to better reuse recently freed space

On 2013-07-08, at 5:59 AM, Jan Kara wrote:
> On Mon 08-07-13 11:24:01, Lukáš Czerner wrote:
>> On Mon, 8 Jul 2013, Jan Kara wrote:
>>> On Mon, 8 Jul 2013, Jan Kara <jack@...e.cz> wrote:
>>> On Mon 08-07-13 09:38:27, Lukas Czerner wrote:
>>>> Currently if the block allocator can not find the goal to
>>>> allocate we would use global goal for stream allocation.
>>>> However the global goal (s_mb_last_group and s_mb_last_start) will move further every time such allocation appears and
>>>> never move backwards.
>>>> 
>>>> This causes several problems in certain scenarios:
>>>> - the goal will move further and further preventing us from
>>>>   reusing space which might have been freed since then. This
>>>>   is ok from the file system point of view because we will
>>>>   reuse that space eventually, however we're allocating block
>>>>   from slower parts of the spinning disk even though it might
>>>>   not be necessary.
>>>> - The above also causes more serious problem for example for
>>>>   thinly  provisioned storage (sparse images backed storage
>>>>   as well), because instead of reusing blocks which are
>>>>   already provisioned we would try to use new blocks. This
>>>>   would unnecessarily drain storage free blocks pool.
>>>> - This will also cause blocks to be allocated further from
>>>>   the given goal than it's necessary. Consider for example
>>>>   truncating, or removing and rewriting the file in the loop.
>>>> This workload will never reuse freed blocks until we continually
>>>> claim and free all the block in the file system.
>>>> 
>>>> Note that file systems like xfs, ext3, or btrfs does not have this problem. This is simply caused by the notion of global pool.
>>>> 
>>>> Fix this by changing the global goal to be goal per inode. This will allow us to invalidate the goal every time the inode has
>>>> been truncated, or newly created, so in those cases we would try
>>>> to use the proper more specific goal which is based on inode
>>>> position.
>>> When looking at your patch for second time, I started wondering,
>>> whether we need per-inode stream goal at all. We already do set
>>> goal in the allocation request for mballoc (ar->goal) e.g. in
>>> ext4_ext_find_goal().
>>> It seems strange to then reset it inside mballoc and I don't
>>> even think mballoc will change it to something else now when
>>> the goal is per-inode and not global.
>> 
>> Yes, we do set the goal in the allocation request and it is supposed
>> to be the "best" goal. However sometimes it can not be fulfilled
>> because we do not have any free block at "goal".
>> 
>> That's when the global (or per-inode) goal comes into play. I suppose
>> that there was several reasons for that. First of all it makes it
>> easier for allocator, because it can directly jump at the point
>> where we allocated last time and it is likely that there is some
>> free space to allocate from - so the benefit is that we do not have
>> to walk all the space in between which is likely to be allocated.
> 
> Yep, but my question is: If we have per-inode streaming goal, can
> you show an example when the "best" goal will be different from the
> "streaming" goal? Because from a (I admit rather quick) look at how
> each of these is computed, it seems that both will point after the
> next allocated block in case of streaming IO.

There were a few goals with the design of the mballoc allocator:
- keep large allocations sized and aligned on RAID chunks
- pack small allocations together, so they can fit into a
  single RAID chunk and avoid many read-modify-write cycles
- keep allocations relatively close together, so that seeking is
  minimized and it doesn't search through fragmented free space

It was designed for high-bandwidth streaming IO, and not small random IO as is seen with VM images and thin-provisioned storage.  That said, with TRIM and sparse storage, there is no guarantee that offset within the virtual device has any association with the real disk, so I'm not sure it makes sense to optimize for this case.

The ext4 code will already bias new allocations towards the beginning of the disk because this is how inodes are allocated and this influences each inode's block allocation.  The global target allows the large allocations to get out of the mess of fragmented free space and into other parts of the filesystem that are not as heavily used.

Rather than tweaking the global target (which is currently just a simple round-robin), it probably makes sense to look at a global extent map for free space, so that it is easier to find large chunks of free space.  It is still desirable to keep these allocations closer together, otherwise the "optimum" location for several large allocations may be on different ends of the disk and the situation would be far worse than making periodic writes over the whole disk.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists