lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon,  6 Apr 2015 13:31:41 -0600
From:	Jun He <jhe@...wisc.edu>
To:	tytso@....edu
Cc:	linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
	Jun He <jhe@...wisc.edu>
Subject: [PATCH] ext4 mballoc: fix tail allocation 

This patch addresses the tail allocation problem found at paper "Reducing File System Tail Latencies with Chopper". https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The paper refers the tail allocation problem as the Special End problem.

Here is a description of the problem:

A tail extent is the last extent of a file. The last block of the tail extent corresponds to the last logical block of the file. When a file is closed and the tail extent is being allocated, ext4 marks the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The intention is to avoid preallocation when we know the file is final (it won't change until the next open.). But the implementation leads to some problems. The following program attacks the problem.

/****************************************/
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    int fd;
    int size, off;
    char buf[4096];

    size = 4096;

    fd = open(argv[1], O_WRONLY|O_CREAT);
    if ( fd == -1 ) {
        perror("opening file");
        exit(1);
    }
    
    off = 0;
    pwrite(fd, buf, size, off);
    printf("wrote at %d, size %d bytes\n", off, size);
    fsync(fd);

    off = off + size;
    pwrite(fd, buf, size, off);
    printf("wrote at %d, size %d bytes\n", off, size);

    close(fd);
    sync();

}
/****************************************/

Mount an ext4 on /mnt/ext4onloop and run the program by 
$ ./a.out /mnt/ext4onloop/testfile

Check the extents by
$ filefrag -sv /mnt/ext4onloop/testfile
Filesystem type is: ef53
File size of /mnt/ext4onloop/testfile is 8192 (2 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    33280               1 
   1       1    33025    33281      1 eof
/mnt/ext4onloop/testfile: 2 extents found

The first 4KB of the file is forced to disk by fsync() and allocated from a locality group preallocation. The second 4KB of the file is forced to disk by sync() (If you don't call sync(), the effect would be the same as the write back thread will eventually kick it and force the data to disk.). Since the second 4KB is allocated when the file is closed and the extent is the tail extent, it is allocated with the hint EXT4_MB_HINT_NOPREALLOC. This hint prevents the second 4KB from being allocated from locality group preallocation. So the second 4KB is not allocated next to the first 4KB.  

The program above demonstrates the tail allocation problem. The Chopper paper demonstrates that the problem could drag extents of the same file very far (GBs) away from each other. 

The EXT4_MB_HINT_NOPREALLOC hint currently means: 
1. if there is an i-node preallocation for this file, use it; 
2. do not use LG (locality group) preallocations, even they are available (I think this is not intended); 
3. do not create any new preallocations.

EXT4_MB_HINT_NOPREALLOC is not the right hint for tail allocation. What we really want is:
1. if file is large and there is an i-node preallocation for this file, use it; 
2. if file is large and there is no i-node preallocation for this file, do not create new inode preallocation. Simply allocate as needed.
3. if file is small and there is LG preallocations, use it. 
4. if file is small and there is no LG preallocations, create a LG preallocation and allocate the tail extent from it. LG preallocations are always useful because small files almost always come. In the case that a write back thread writes a lot of new small files, this design will group the small files together and reduce seeks. Also, allocating many small files from LG preallocation space should be faster than allocating them from buddy cache separately. 

To get the correct behaviors:

EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4 will be satisfied. Now, we only need to handle the tail of large file (2) by checking tail when normalizing. If it is a tail of large file, we simply do not normalize it. 

The patch with Linux 4.0rc6 has been tested with kvm-xfstests. generic/256 failed. But I wouldn't worry about it because the vanilla kernel also failed...


Signed-off-by: Jun He <jhe@...wisc.edu>

---
 fs/ext4/mballoc.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8d1e602..49c8547 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3004,7 +3004,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int bsbits, max;
 	ext4_lblk_t end;
-	loff_t size, start_off;
+	loff_t size, isize, start_off;
 	loff_t orig_size __maybe_unused;
 	ext4_lblk_t start;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
@@ -3019,8 +3019,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 
-	/* caller may indicate that preallocation isn't
-	 * required (it's a tail, for example) */
+	/* caller may indicate that preallocation isn't required */
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 
@@ -3029,11 +3028,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		return ;
 	}
 
-	bsbits = ac->ac_sb->s_blocksize_bits;
-
 	/* first, let's learn actual file size
 	 * given current request is allocated */
 	size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
+
+	bsbits = ac->ac_sb->s_blocksize_bits;
+	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
+		>> bsbits;
+
+	/* don't normalize tail of large files */
+	if ((size == isize) &&
+	    !ext4_fs_is_busy(sbi) &&
+	    (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
+		return;
+	}
+
 	size = size << bsbits;
 	if (size < i_size_read(ac->ac_inode))
 		size = i_size_read(ac->ac_inode);
@@ -4107,13 +4116,6 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
 		>> bsbits;
 
-	if ((size == isize) &&
-	    !ext4_fs_is_busy(sbi) &&
-	    (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
-		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
-		return;
-	}
-
 	if (sbi->s_mb_group_prealloc <= 0) {
 		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
 		return;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists