linux-ext4 - OpenPosix test case mmap_11-4 fails in ext4 filesystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <537C4A2F.8040402@cn.fujitsu.com>
Date:	Wed, 21 May 2014 14:39:43 +0800
From:	Xiaoguang Wang <wangxg.fnst@...fujitsu.com>
To:	<linux-ext4@...r.kernel.org>
CC:	<tytso@....edu>, <chrubis@...e.cz>
Subject: OpenPosix test case mmap_11-4 fails in ext4 filesystem

Hi,

Recently I met the mmap_11-4 fails when running LTP in RHEL7.0RC. Attached is a
test program to reproduce this problem, which is written by Cyril.  Uncommenting the
msync() makes the test succeed in old linux distribution, such as RHEL6.5GA, but
fails in RHEL7.0RC.

I also read  some ext4's source code in RHEL7.0RC and here is the possible reason
according to my understanding. Hope this will help you something.
--------------------------------------------------------------------------------------------

When calling msync() in an ext4 file system, ext4_bio_write_page will be
called to write back dirty pages. Here is the source code in RHEL7.0RC:

int ext4_bio_write_page(struct ext4_io_submit *io, struct page *page, int len, struct writeback_control *wbc)
 {
         struct inode *inode = page->mapping->host;
         unsigned block_start, blocksize;
         struct buffer_head *bh, *head;
         int ret = 0;
         int nr_submitted = 0;
 
         blocksize = 1 << inode->i_blkbits;
 
         BUG_ON(!PageLocked(page));
         BUG_ON(PageWriteback(page));
 
         set_page_writeback(page);
         ClearPageError(page);
 
         ......

         bh = head = page_buffers(page);
         do {
                 block_start = bh_offset(bh);
                 if (block_start >= len) {
                         /*
                          * Comments copied from block_write_full_page_endio:
                          *
                          * The page straddles i_size.  It must be zeroed out on
                          * each and every writepage invocation because it may
                          * be mmapped.  "A file is mapped in multiples of the
                          * page size.  For a file that is not a multiple of
                          * the  page size, the remaining memory is zeroed when
                          * mapped, and writes to that region are not written
                          * out to the file."
                          */
                         zero_user_segment(page, block_start,
                                           block_start + blocksize);
                         clear_buffer_dirty(bh);
                         set_buffer_uptodate(bh);
                         continue;
                 }
                 ......
         } while ((bh = bh->b_this_page) != head);
--------------------------------------------------------------------------------------------
I deleted some irrelevant code.

The argument len is computed by the following code:
loff_t size = i_size_read(inode);  // file's length
if (index == size >> PAGE_CACHE_SHIFT)
        len = size & ~PAGE_CACHE_MASK;
else
        len = PAGE_CACHE_SIZE;

That means len is the valid file length in every page.

When ext4 file system's block size is 1024, then there will be 4 struct buffer head attached to this page.

See the above "do... while ..." statements in ext4_bio_write_page(), "block_start = bh_offset(bh);" will make
block_start be 0 for the first buffer head, 1024 for the second, 2048 for the third, 3072 for the forth.

So in the reproduce program, in this case, len is 2048,  the  "if (block_start >= len) "
condition will be satisfied in the third and forth iteration, so "zero_user_segment(page, block_start, block_start + blocksize);" will
be called, then the content beyond the file's end will be zeroed, so the reproduce program will succeed.

But when ext4 file system's block size if 4096, then there will only on buffer head attached to
this page, then when len is 2048,  "while ((bh = bh->b_this_page) != head);" statement  will make the "do ... while..."
statement execute only once. In the first iteration, "block_start = bh_offset(bh); " will make
block_start be 0, " if (block_start >= len) "  won't be satisfied,  zero_user_segment() won't be called,
so the content in current page  beyond the file's end will not be zeroed, so the reproduce program fails.

In RHEL6.5GA, block_write_full_page() will be called to do work similar to ext4_bio_write_page, this function does
not do the zero work in unit of struct buffer head, so this bug is not exist.

The above is my understanding. If it's not correct, I'd like that you explain the true reason, thanks.

Also I don't know whether this can be considered an ext4 bug or not. And according msync()'s definition,
it seems that it dose not require msync() to zero the partial page beyond mmaped file's area. But at least, msync()'s behavior will be
different when ext4 file system has different block size, I think we may fix these to keep consistency.

Or you think ext4's implementation is correct and the LTP mmap_11-4 case is invalid? Thanks.

Regards,
Xiaoguang Wang

View attachment "test.c" of type "text/x-csrc" (2045 bytes)