linux-ext4 - Re: [patch 42/51] ext2: fix data corruption for racing writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <604427e00904140725p5451e969j8446d95ef89ed9e2@mail.gmail.com>
Date:	Tue, 14 Apr 2009 07:25:31 -0700
From:	Ying Han <yinghan@...gle.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Nick Piggin <nickpiggin@...oo.com.au>, akpm@...ux-foundation.org,
	torvalds@...ux-foundation.org, cmm@...ibm.com,
	linux-ext4@...r.kernel.org
Subject: Re: [patch 42/51] ext2: fix data corruption for racing writes

On Tue, Apr 14, 2009 at 6:13 AM, Jan Kara <jack@...e.cz> wrote:
> On Tue 14-04-09 23:02:02, Nick Piggin wrote:
>> Thanks for the fix. I meant to ask but forgot: I guess you have checked
>> that no other ext/minix derived filesystem has similar problems? I did
>> attempt to reproduce with ext3 but couldn't. I don't know enough of
>> the code to say whether it does not exist though.
>  I've checked that ext3 does the right thing (in fact I've made ext2 do
> a similar thing as ext3 does) and ext4 is fine as well. I've looked at minix
> now and it seems to be correct as well. It's curious how the bug got into
> ext2...
I haven't been able to reproduce on ext4 neither.

--Ying
>
>                                                                Honza
>
>> On Tuesday 14 April 2009 07:40:14 akpm@...ux-foundation.org wrote:
>> > From: Jan Kara <jack@...e.cz>
>> >
>> > If two writers allocating blocks to file race with each other (e.g.
>> > because writepages races with ordinary write or two writepages race with
>> > each other), ext2_getblock() can be called on the same inode in parallel.
>> > Before we are going to allocate new blocks, we have to recheck the block
>> > chain we have obtained so far without holding truncate_mutex.  Otherwise
>> > we could overwrite the indirect block pointer set by the other writer
>> > leading to data loss.
>> >
>> > The below test program by Ying is able to reproduce the data loss with ext2
>> > on in BRD in a few minutes if the machine is under memory pressure:
>> >
>> > long kMemSize  = 50 << 20;
>> > int kPageSize = 4096;
>> >
>> > int main(int argc, char **argv) {
>> >     int status;
>> >     int count = 0;
>> >     int i;
>> >     char *fname = "/mnt/test.mmap";
>> >     char *mem;
>> >     unlink(fname);
>> >     int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
>> >     status = ftruncate(fd, kMemSize);
>> >     mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> >     // Fill the memory with 1s.
>> >     memset(mem, 1, kMemSize);
>> >     sleep(2);
>> >     for (i = 0; i < kMemSize; i++) {
>> >             int byte_good = mem[i] != 0;
>> >             if (!byte_good && ((i % kPageSize) == 0)) {
>> >                     //printf("%d ", i / kPageSize);
>> >                     count++;
>> >             }
>> >     }
>> >     munmap(mem, kMemSize);
>> >     close(fd);
>> >     unlink(fname);
>> >
>> >     if (count > 0) {
>> >             printf("Running %d bad page\n", count);
>> >             return 1;
>> >     }
>> >     return 0;
>> > }
>> >
>> > Cc: Ying Han <yinghan@...gle.com>
>> > Cc: Nick Piggin <nickpiggin@...oo.com.au>
>> > Signed-off-by: Jan Kara <jack@...e.cz>
>> > Cc: Mingming Cao <cmm@...ibm.com>
>> > Cc: <linux-ext4@...r.kernel.org>
>> > Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>> > ---
>> >
>> >  fs/ext2/inode.c |   44 +++++++++++++++++++++++++++++++++-----------
>> >  1 file changed, 33 insertions(+), 11 deletions(-)
>> >
>> > diff -puN fs/ext2/inode.c~ext2-fix-data-corruption-for-racing-writes fs/ext2/inode.c
>> > --- a/fs/ext2/inode.c~ext2-fix-data-corruption-for-racing-writes
>> > +++ a/fs/ext2/inode.c
>> > @@ -590,9 +590,8 @@ static int ext2_get_blocks(struct inode
>> >
>> >     if (depth == 0)
>> >             return (err);
>> > -reread:
>> > -   partial = ext2_get_branch(inode, depth, offsets, chain, &err);
>> >
>> > +   partial = ext2_get_branch(inode, depth, offsets, chain, &err);
>> >     /* Simplest case - block found, no allocation needed */
>> >     if (!partial) {
>> >             first_block = le32_to_cpu(chain[depth - 1].key);
>> > @@ -602,15 +601,16 @@ reread:
>> >             while (count < maxblocks && count <= blocks_to_boundary) {
>> >                     ext2_fsblk_t blk;
>> >
>> > -                   if (!verify_chain(chain, partial)) {
>> > +                   if (!verify_chain(chain, chain + depth - 1)) {
>> >                             /*
>> >                              * Indirect block might be removed by
>> >                              * truncate while we were reading it.
>> >                              * Handling of that case: forget what we've
>> >                              * got now, go to reread.
>> >                              */
>> > +                           err = -EAGAIN;
>> >                             count = 0;
>> > -                           goto changed;
>> > +                           break;
>> >                     }
>> >                     blk = le32_to_cpu(*(chain[depth-1].p + count));
>> >                     if (blk == first_block + count)
>> > @@ -618,7 +618,8 @@ reread:
>> >                     else
>> >                             break;
>> >             }
>> > -           goto got_it;
>> > +           if (err != -EAGAIN)
>> > +                   goto got_it;
>> >     }
>> >
>> >     /* Next simple case - plain lookup or failed read of indirect block */
>> > @@ -626,6 +627,33 @@ reread:
>> >             goto cleanup;
>> >
>> >     mutex_lock(&ei->truncate_mutex);
>> > +   /*
>> > +    * If the indirect block is missing while we are reading
>> > +    * the chain(ext3_get_branch() returns -EAGAIN err), or
>> > +    * if the chain has been changed after we grab the semaphore,
>> > +    * (either because another process truncated this branch, or
>> > +    * another get_block allocated this branch) re-grab the chain to see if
>> > +    * the request block has been allocated or not.
>> > +    *
>> > +    * Since we already block the truncate/other get_block
>> > +    * at this point, we will have the current copy of the chain when we
>> > +    * splice the branch into the tree.
>> > +    */
>> > +   if (err == -EAGAIN || !verify_chain(chain, partial)) {
>> > +           while (partial > chain) {
>> > +                   brelse(partial->bh);
>> > +                   partial--;
>> > +           }
>> > +           partial = ext2_get_branch(inode, depth, offsets, chain, &err);
>> > +           if (!partial) {
>> > +                   count++;
>> > +                   mutex_unlock(&ei->truncate_mutex);
>> > +                   if (err)
>> > +                           goto cleanup;
>> > +                   clear_buffer_new(bh_result);
>> > +                   goto got_it;
>> > +           }
>> > +   }
>> >
>> >     /*
>> >      * Okay, we need to do block allocation.  Lazily initialize the block
>> > @@ -683,12 +711,6 @@ cleanup:
>> >             partial--;
>> >     }
>> >     return err;
>> > -changed:
>> > -   while (partial > chain) {
>> > -           brelse(partial->bh);
>> > -           partial--;
>> > -   }
>> > -   goto reread;
>> >  }
>> >
>> >  int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create)
>> > _
>> >
>>
> --
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html