[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFZ0FUUGpkVCmRuLN-iQ2N73D3wN+0Y6k_3TA4FzjVow=P447A@mail.gmail.com>
Date: Wed, 15 Feb 2012 11:19:33 +0800
From: Robin Dong <hao.bigrat@...il.com>
To: Aditya Kali <adityakali@...gle.com>
Cc: "Ted Ts'o" <tytso@....edu>, linux-ext4@...r.kernel.org
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of
cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()
在 2012年2月15日 上午5:24,Aditya Kali <adityakali@...gle.com> 写道:
> Hi Robin,
>
>
> On Tue, Feb 14, 2012 at 12:01 AM, Robin Dong <hao.bigrat@...il.com> wrote:
>>
>> 在 2012年2月10日 上午10:55,Ted Ts'o <tytso@....edu> 写道:
>> > On Thu, Feb 02, 2012 at 03:48:49PM +0800, Robin Dong wrote:
>> >>
>> >> The reason is ext4_has_free_clusters reporting wrong
>> >> result. Actually, the unit of argument "dirty_clusters" is block, so
>> >> we should tranform the sbi->s_dirtyclusters_counter to block , just
>> >> like "free_clusters".
>> >
>> > I've been looking at the the delayed allocation reservation code, and
>> > I think it's a lot more confused that this. Some places we are
>> > playing with fields as if they are clusters, and some times as if they
>> > are blocks, and I've definitely found places where we're not handling
>> > quota correctly with bigalloc (ext4_map_blocks is calling
>> > ext4_da_update_reserve_space() with a value which is clearly in units
>> > of blocks, but we are treating that vale as if it is clusters, etc.)
>> >
>> > So I think the problem is a lot larger than what you pointed out, and
>> > we shouldn't fix it just by throwing in an EXT4_C2B call. If
>> > s_dirtyclusters_counter should be denominated in blocks, then we need
>> > to rename it. And there there's this comment which made me wince:
>> >
>> > * Note that in case of bigalloc, i_reserved_meta_blocks,
>> > * i_reserved_data_blocks, etc. refer to number of clusters.
>> >
>> > Argh, no. If something is denominated in clusters, then it should be
>> > appropriately named as such.
>> >
>> > Actually, I think the problem is a lot bigger than this, and it's a
>> > conceptual one. When we try writing to a block which does not have a
>> > mapping, and bigalloc is enabled, there are three cases:
>> >
>> > 1) This is the first time the cluster has ever been written to; hence,
>> > even though we are dirtying a block, we need to reserve room for the
>> > entire cluster.
>> >
>> > 2) While the block has not been mapped (and so the data on disk is
>> > uninitialized) the cluster has indeed been allocated, so we don't need
>> > to reserve any additional room for the block.
>> >
>> > 3) One or more blocks in the cluster have already been written to via
>> > delayed allocation, but the cluster itself has not been selected
>> > (allocated) yet. But since the space for the entire cluster was
>> > reserved when the first block in the cluster was dirtied (case #1), we
>> > don't need to reserve any more space.
>> >
>> > We aren't handling these cases correctly today, since we don't have
>> > this logic here. These cases don't matter if you're only using
>> > fallocate() and direct I/O, but if you are using buffered writes and
>> > delayed allocation, they obviously matter a huge amount.
>> >
>>
>> > And I don't think we're handling it correctly. Argh...
>> >
>
> I thought we do handle these cases correctly after the 'ext4: attempt
> to fix race in bigalloc code path' change. May be I missed something
> or the code changed somewhere. If there is a testcase that fails, I
> can take a look.
>
>> > - Ted
>>
>> Hi, Ted and Aditya
>>
>> I am trying to fix this "bigalloc quota reservation in
>> delay-allocation" problem recently, and I feel confusion about these
>> code below in ext4_ext_map_blocks():
>>
>> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
>> unsigned int reserved_clusters;
>> /*
>> * Check how many clusters we had reserved this allocated range
>> */
>> reserved_clusters = get_reserved_cluster_alloc(inode,
>> map->m_lblk, allocated);
>> if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
>> if (reserved_clusters) {
>> ......
>> ext4_da_update_reserve_space(inode,
>> reserved_clusters, 0);
>> }
>> } else {
>> BUG_ON(allocated_clusters < reserved_clusters);
>> /* We will claim quota for all newly allocated blocks.*/
>> ext4_da_update_reserve_space(inode, allocated_clusters,
>> 1);
>> if (reserved_clusters < allocated_clusters) {
>> struct ext4_inode_info *ei = EXT4_I(inode);
>> int reservation = allocated_clusters -
>> reserved_clusters;
>> dquot_reserve_block(inode,
>> EXT4_C2B(sbi, reservation));
>> spin_lock(&ei->i_block_reservation_lock);
>> ei->i_reserved_data_blocks += reservation;
>> spin_unlock(&ei->i_block_reservation_lock);
>> }
>> }
>> }
>>
>> if we have allocated 3 clusters, why should we add "reservation" back
>> to i_reservd_data_blocks ?
> Because we will end up here again next time when we write blocks
> [10-11] and try to claim space for them. At this time, there will be
> no more reserved blocks left (unless you give them back) and you will
> hit the warning in ext4_da_update_reserve_space():
>
> if (unlikely(used > ei->i_reserved_data_blocks)) {
> ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d "
> "with only %d reserved data blocks\n",
> __func__, inode->i_ino, used,
> ei->i_reserved_data_blocks);
> WARN_ON(1);
> used = ei->i_reserved_data_blocks;
> }
>
>
>> (Aditya Kali 's long comment dose not make
>> sense to me) Maybe we don't need to do ext4_da_update_reserve_space
>> when the new block is allocated from "implied_cluster"? Why not just
>> change them to:
>>
>> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
>> if ( !(map->m_flags & EXT4_MAP_FROM_CLUSTER) )
>> /* We will claim quota for all newly allocated blocks.*/
>> ext4_da_update_reserve_space(inode, allocated_clusters,
>> 1);
>> }
>>
>
> This won't work in all cases, like the one given in the long comment.
> With above change, ext4_da_update_reserve_space() will get called
> again the next time for some delayed allocated blocks in the same
> cluster, but there isn't any reserved blocks left to claim.
> The flag 'EXT4_MAP_FROM_CLUSTER' is not a sufficient check to
> determine 'how many' clusters we need to claim. This is because
> get_implied_cluster_alloc() only checks within already allocated
> extents (and not in the page cache). So, I had to add the function
> get_reserved_cluster_alloc() to check if there are any delayed
> allocated blocks still in the page-cache.
But I found the code in ext4_ext_map_blocks():
if ((sbi->s_cluster_ratio > 1) &&
ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
map->m_flags |= EXT4_MAP_FROM_CLUSTER;
thus, even the delayed allocated block in the page-cache (have not be
allocated) already have been checked. Therefore in my opinion, the
EXT4_MAP_FROM_CLUSTER is a sufficient check.
Follow the case of your long comment :) , process will go like this
in MY CODE ABOVE :
[0-3], [4-7], [8-11]
1. delay-allocation write blocks 10&11
we reserve 1 cluster, the i_reserved_data_blocks is 1 now
2. delay-allocation write blocks 3 to 8
we reserve other 2 clusters, so 3 clusters totally, the
i_reserved_data_blocks is 3 now
3. writeout the blocks 3 to 8
claim all 3 clusters, i_reserved_data_blocks is 0 now
At this moment, we really have allocated 3 clusters, and the
remain 10&11 block would never occupy another cluster (it would
definitely go into the [8-11] cluster), so, why need we reserve one
more cluster quota ?
4. writeout the 10&11 blocks
the 10&11 blocks will be set EXT4_MAP_FROM_CLUSTER flag ( by
get_implied_cluster_alloc as the 8~9 block have been allocated), and
it will not call ext4_da_update_reserve_space any more
As this, no warning, no reserve-and-release ping pong....at least in my imaging
Could you please point out what I missed ?
Thanks
> You will also miss the other case (when EXT4_MAP_FROM_CLUSTER is set)
> where we had reserved the cluster, but ended up not allocating it.
> Since this cluster was not allocated, we need to update the reserved
> space without claiming quota (ext4_da_update_reserve_space(inode,
> reserved_clusters, 0) --- notice the last argument is '0').
>
> I know its kinda messy, but I think it works. Also, since we don't
> keep track of delayed extents, its harder to check in one go.
>
> As for your proposed fix earlier, the correct solution will be to not
> convert 'free_clusters' value into blocks.
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index f9e2cd8..2c518f8 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -426,7 +426,7 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,
>
> if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
> EXT4_FREECLUSTERS_WATERMARK) {
> - free_clusters = EXT4_C2B(sbi,
> percpu_counter_sum_positive(fcc));
> + free_clusters = percpu_counter_sum_positive(fcc);
> dirty_clusters = percpu_counter_sum_positive(dcc);
> }
> /* Check whether we have space after accounting for current
>
>
> We were incorrectly converting clusters to blocks and so estimating
> that we have more free clusters available than what we actually do.
> Hence the dmesg errors at write time.
>
>>
>> --
>> --
>> Best Regard
>> Robin Dong
>
>
> Thanks,
> Aditya
--
--
Best Regard
Robin Dong
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists