linux-ext4 - Re: ENOSPC returned during writepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1219268135.7895.30.camel@mingming-laptop>
Date:	Wed, 20 Aug 2008 14:35:35 -0700
From:	Mingming Cao <cmm@...ibm.com>
To:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
Cc:	Theodore Tso <tytso@....edu>,
	ext4 development <linux-ext4@...r.kernel.org>
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 23:57 +0530，Aneesh Kumar K.V写道：
> On Wed, Aug 20, 2008 at 07:53:31AM -0400, Theodore Tso wrote:
> > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> > 
> > We don't actually lose the data if free blocks are subsequently made
> > available, correct?
> > 
> > > I tried this patch. There are still multiple ways we can get wrong free
> > > block count. The patch reduced the number of errors. So we are doing
> > > better with patch. But I guess we can't use the percpu_counter based
> > > free block accounting with delalloc. Without delalloc it is ok even if
> > > we find some wrong free blocks count . The actual block allocation will fail in
> > > that case and we handle it perfectly fine. With delalloc we cannot
> > > afford to fail the block allocation. Should we look at a free block
> > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> > 
> > It would be a shame if we did given that the whole point of the percpu
> > counter was to avoid a scalability bottleneck.  Perhaps we could take
> > a filesystem-level spinlock only when the number of free blocks as
> > reported by the percpu_counter falls below some critical level?
> > 
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> > >  	}
> > >  	/* reduce fs free blocks counter */
> > >  	percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> > > -
> > > +	/*
> > > +	 * Now check whether the block count has gone negative.
> > > +	 * Some other CPU could have reserved blocks in between
> > > +	 */
> > > +	if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> > > +		spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > > +		return -ENOSPC;
> > > +	}
> > 
> > 
> > I think you want to do the check before calling percpu_counter_sub();
> > otherwise when you return ENOSPC the free blocks counter ends up
> > getting reduced (and gets left negative).
> > 
> > Also, this is one of the places where it might help if we did
> > something like:
> > 
> > 	freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> > 	if (freeblocks < NR_CPUS*4)
> > 		freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
> > 
> > 	if (freeblocks < total) {
> > 		spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > 		return -ENOSPC;
> > 	}
> > 
> > BTW, I was looking at the percpu_counter interface, and I'm confused
> > why we have percpu_counter_sum_and_set() and percpu_counter_sum().  If
> > we're taking the fbc->lock to calculate the precise value of the
> > counter, why not simply set fbc->count?  
> > 
> > Also, it is singularly unfortunate that certain interfaces, such as
> > percpu_counter_sum_and_set() only exist for CONFIG_SMP.  This is
> > definitely post-2.6.27, but it seems to me that we probably want
> > something like percpu_counter_compare_lt() which does something like this:
> > 
> > static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> > 					    s64 amount)
> > {
> > #ifdef CONFIG_SMP
> > 	if ((fbc->count - amount) < FBC_BATCH)
> > 		percpu_counter_sum_and_set(fbc);
> > #endif
> > 	return 	(fbc->count < amount);
> > }
> > 
> > ... which we would then use in ext4_has_free_blocks() and
> > ext4_da_reserve_space().
> > 
> 
> Let's say FBC_BATCH = 64 and fbc->count = 100 and we have four cpus and
> each cpu request for 30 blocks. each CPU does
> 

But, ext4_da_reserve_space() is called at the prepare_write/write_begin
time for each page to write, so at most per cpu would request 1 block at
a time, it is not possible to request reserve 30 blocks at a time.

> in ext4_has_free_blocks:
> free_blocks - nblocks = 100 - 30 = 70 and is > FBC_BATCH So we don't do

free_blocks is not necessary 100, 

free_blocks is percpu_counter_read_positive(), which reads the local cpu
counter. In your example, if the global counter is 100, but the local
cpu counter is 0, then you will get free_blocks = 0 here.  nblocks = 1,
then you will get

free_blocks - nblocks = 0-1 =-1, which will call
percpu_counter_sum_and_set() to get more accurate value.

> percpu_counter_sum_and_set
> That means ext4_has_free_blocks return success
> 
> Now while claiming blocks we do
> __percpu_counter_add(fbc, 30, 64)
> 
> here  30  < 64. That means we don't do fbc->count += count.
> so fbc->count remains as 100 and we have 4  cpu successfully
> allocating 30 blocks which means we have to satisfy 120 blocks.
> 

The situation you described here could happen, but really rare and
should happen at the case fs is really full. The total number of global
free blocks have to be less than  total number of CPU, and there are
multiple threads write/allocate on each cpu.

Mingming
> -aneesh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html