lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120801192934.GB19139@mcmilk.de>
Date:	Wed, 1 Aug 2012 21:29:35 +0200
From:	Tino Reichardt <list-linux-fsdevel@...ilk.de>
To:	jfs-discussion@...ts.sourceforge.net, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org
Subject: Re: [Jfs-discussion] [PATCH] fs/jfs: TRIM support for JFS
 Filesystem

* Dave Kleikamp <dave.kleikamp@...cle.com> wrote:
> On 07/28/2012 06:08 AM, Tino Reichardt wrote:
> > * Tino Reichardt <list-linux-fsdevel@...ilk.de> wrote:
> >> > This patch adds support for the two linux interfaces of the discard/TRIM
> >> > command for SSD devices and sparse/thinly-provisioned LUNs.
> > Fixed a problem when setting minlen in jfs_ioc_trim().
> 
> Overall, I approve of this. checkpatch reports some formatting problems:
> whitespace, long lines, etc. You could probably ignore the warning
> changing printk to pr_err or pr_info, since none of the other JFS code
> does that.

I removed all errors exept the call to simple_strtoull(). Also the
printk() was replaced in super.c with pr_err(). The other files of jfs
could also easily changed to the pr_*() calls... I can make another
patch for it, if wanted :)


> Other comments are inline below. Pending additional input, I can push a
> cleaned-up patch upstream.

That would be nice.

> > 
> > Signed-off-by: Tino Reichardt <list-jfs@...ilk.de>
> Acked-by: Dave Kleikamp <dave.kleikamp@...cle.com>

Signed-off-by: Tino Reichardt <list-jfs@...ilk.de>


> 
> > 
> >  ...
> > 
> >  
> >  errors=continue		Keep going on a filesystem error.
> > -errors=remount-ro	Default. Remount the filesystem read-only on an error.
> > +errors=remount-ro(*)	Default. Remount the filesystem read-only on an error.
> >  errors=panic		Panic and halt the machine if an error occurs.
> >  
> >  uid=value	Override on-disk uid with specified value
> 
> I don't mind the (*) indicating the defaults. This would be consistent
> with some of the other files in this directory. But we should then
> remove the word "Default", which becomes redundant.

Is done.

> 
> > @@ -35,6 +36,18 @@ umask=value	Override on-disk umask with
> >  		directories, the execute bit will be set if the corresponding
> >  		read bit is set.
> >  
> > +discard=minlen	This enables/disables the use of discard/TRIM commands.
> > +discard		The discard/TRIM commands are sent to the underlying
> > +nodiscard(*)	block device when blocks are freed. This is useful for SSD
> > +		devices and sparse/thinly-provisioned LUNs.  The FITRIM ioctl
> > +		command is also available together with the nodiscard option.
> > +		The value of minlen specifies the minimum blockcount, when
> > +		a TRIM command to the block device is considered usefull.
> > +		When no value is given to the discard option, it defaults to
> > +		64 blocks, which means 256KiB in JFS.
> > +		The minlen value of discard overrides the minlen value given
> > +		on an FITRIM ioctl().
> > +
> >  Please send bugs, comments, cards and letters to shaggy@...ux.vnet.ibm.com.
> 
> Oops, not a problem with your patch. I still have my old email address
> here. I need to remove that.

ACK :)

> 
> > diff -X exclude -urpN linux-git/fs/jfs/jfs_discard.c linux_jfs-trim/fs/jfs/jfs_discard.c
> > --- linux-git/fs/jfs/jfs_discard.c	1970-01-01 01:00:00.000000000 +0100
> > +++ linux_jfs-trim/fs/jfs/jfs_discard.c	2012-07-26 22:53:48.640979880 +0200
> > @@ -0,0 +1,119 @@
> > +/*
> > + *   Copyright (C) Tino Reichardt, 2012
> > + *
> > + *   This program is free software;  you can redistribute it and/or modify
> > + *   it under the terms of the GNU General Public License as published by
> > + *   the Free Software Foundation; either version 2 of the License, or
> > + *   (at your option) any later version.
> > + *
> > + *   This program is distributed in the hope that it will be useful,
> > + *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
> > + *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
> > + *   the GNU General Public License for more details.
> > + *
> > + *   You should have received a copy of the GNU General Public License
> > + *   along with this program;  if not, write to the Free Software
> > + *   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/slab.h>
> > +#include <linux/blkdev.h>
> > +
> > +#include "jfs_incore.h"
> > +#include "jfs_superblock.h"
> > +#include "jfs_dmap.h"
> > +#include "jfs_lock.h"
> > +#include "jfs_debug.h"
> 
> Should include "jfs_discard.h" to ensure that the declarations and
> definitions of these functions stay in sync. That would also clean up a
> sparse warning.

Done.
> 
> > +
> > +
> > +/*
> > + * NAME:	jfs_issue_discard()
> > + *
> > + * FUNCTION:	TRIM the specified block range on device, if supported
> > + *
> > + * PARAMETERS:
> > + *	ip	- pointer to in-core inode
> > + *	blkno	- starting block number to be trimmed (0..N)
> > + *	nblocks	- number of blocks to be trimmed
> > + *
> > + * RETURN VALUES:
> > + *	none
> > + *
> > + * serialization: IREAD_LOCK(ipbmap) held on entry/exit;
> > + */
> > +void jfs_issue_discard(struct inode *ip, u64 blkno, u64 nblocks)
> > +{
> > +	struct super_block *sb = ip->i_sb;
> > +	int r = 0;
> > +
> > +	r = sb_issue_discard(sb, blkno, nblocks, GFP_NOFS, 0);
> > +	if (unlikely(r != 0)) {
> > +		printk(KERN_ERR "JFS: sb_issue_discard"
> > +			"(%p, %llu, %llu, GFP_NOFS, 0) = %d => failure!\n", sb,
> > +			(unsigned long long)blkno,
> > +			(unsigned long long)nblocks, r);
> > +	}
> > +
> > +#ifdef JFS_DEBUG_TRIM
> > +	printk(KERN_INFO "JFS: sb_issue_discard"
> > +		"(%p, %llu, %llu, GFP_NOFS, 0) = %d\n", sb,
> > +		(unsigned long long)blkno,
> > +		(unsigned long long)nblocks, r);
> > +#endif
> 
> Okay for now, but probably should drop the debug stuff before merging
> into mainline.

Done, all debugging is removed now. Just one jfs_info() is left, like
other different functions use it.

> >  
> >  /*
> > + * NAME:	dbDiscardAG()
> > + *
> > + * FUNCTION:	attempt to discard (TRIM) all free blocks of specific AG
> > + *
> > + * 		algorithm:
> > + * 		1) allocate blocks, as large as possible and save them
> 		   while holding IWRITE_LOCK on ipbmap

Done, thaks for correction.

> > + * 		2) trim all these saved block/length values
> > + * 		3) mark the blocks free again
> > + *
> > + * 		benefit:
> > + * 		- we work only on one ag at some time, which is fully blocked
> 		maybe say "we work only on one ag at some time,
> 			   minimizing how long we need to lock ipbmap"
> 		- no lock is held while discarding the blocks

Done, thaks for correction.

> > + * 		- reading / writing the fs is possible most time, even on trimming
> > + *
> > + * 		downside:
> > + * 		- we write two times to the dmapctl and dmap pages
> > + * 		- but for me, this seems the best way, better ideas?
> > + * 		/TR 2012
> > + *
> > + * PARAMETERS:
> > + *	ip	- pointer to in-core inode
> > + *	agno	- ag to trim
> > + *	minlen	- minimum value of contiguous blocks
> > + *
> > + * RETURN VALUES:
> > + *	s64	- actual number of blocks trimmed
> > + */
> > +s64 dbDiscardAG(struct inode *ip, int agno, s64 minlen)
> > +{
> > +	struct inode *ipbmap = JFS_SBI(ip->i_sb)->ipbmap;
> > +	struct bmap *bmp = JFS_SBI(ip->i_sb)->bmap;
> > +	s64 nblocks, blkno;
> > +	u64 trimmed = 0;
> > +	int rc, l2nb;
> > +	struct super_block *sb = ipbmap->i_sb;
> > +
> > +	struct range2trim {
> > +		u64 blkno;
> > +		u64 nblocks;
> > +	} *totrim, *tt;
> > +
> > +	/* max blkno / nblocks pairs to trim */
> > +	int count = 0, range_cnt = 32 * 1024;
> > +
> > +	/* prevent others from writing new stuff here, while trimming */
> > +	IWRITE_LOCK(ipbmap, RDWRLOCK_DMAP);
> > +
> > +	/* worst value: each free block gets an entry */
> This comment is no longer accurate or necessary, as you only allocate
> range_cnt elements. It could be limited to min(range_cnt,
> nblocks/minlen+1), but that's probably not too large an allocation for
> modern hardware.

Yes, I also included now your suggestion with min(range_cnt, ...)

> 
> > +	nblocks = bmp->db_agfree[agno];
> > +	totrim = kmalloc(sizeof(struct range2trim) * range_cnt, GFP_NOFS);
> > +	if (totrim == NULL) {
> > +		jfs_error(bmp->db_ipbmap->i_sb,
> > +			  "dbDiscardAG: no space for trim array");
> > +		IWRITE_UNLOCK(ipbmap);
> > +		return 0;
> > +	}
> > +
> > +	tt = totrim;
> > +	while (nblocks >= minlen) {
> > +		l2nb = BLKSTOL2(nblocks);
> > +
> > +		/* 0 = okay, -EIO = fatal, -ENOSPC -> block kleiner */
> 
> English comments please. :-) Maybe "try smaller block"

Hm, Ja habe ich angepasst :)

> 
> > +		rc = dbAllocAG(bmp, agno, nblocks, l2nb, &blkno);
> > +		if (rc == 0) {
> > +			tt->blkno = blkno;
> > +			tt->nblocks = nblocks;
> > +			tt++; count++;
> > +
> > +#ifdef JFS_DEBUG_TRIM
> > +		printk(KERN_INFO "JFS: agno=%d/%d, blkno:%ld, nblocks=%ld\n",
> > +			agno+1, bmp->db_numag, (long int)blkno,
> > +			(long int)nblocks);
> > +#endif
> > +
> > +			/* the whole ag is free, trim now */
> > +			if (bmp->db_agfree[agno] == 0)
> > +				break;
> > +
> > +			/* give a hint for the next while */
> > +			nblocks = bmp->db_agfree[agno];
> > +			continue;
> > +		} else if (rc == -ENOSPC) {
> > +			/* search for next smaller log2 block */
> > +			l2nb = BLKSTOL2(nblocks) - 1;
> > +			nblocks = 1 << l2nb;
> > +		} else {
> > +			/* Trim any already-allocated blocks */
> > +			printk(KERN_ERR "JFS: dbDiscardAG: -EIO\n");
> > +			break;
> > +		}
> > +
> > +		/* check, if our trim array is full */
> > +		if (unlikely(count >= range_cnt - 1))
> > +			break;
> > +	}
> > +	IWRITE_UNLOCK(ipbmap);
> > +
> > +	tt->nblocks = 0; /* mark the current end */
> > +	for (tt = totrim; tt->nblocks != 0; tt++) {
> > +		if (!(JFS_SBI(sb)->flag & JFS_DISCARD)) {
> > +			/* not needed, when online discard is used */
> 
> Why enter the function at all if JFS_DISCARD is set? But is this really
> true? Removing files or file fragments that are smaller than
> minblks_trim will fail to discard them dynamically.

The other FS can also trim via fstrim(8) when mounted with discard
option :) It is important, that a user can discard all free blocks, even
when mounting with discard option. The FS could also be mounted several
times without discard option, and then there are some ranges, where the
device isn't informed about these ranges. So the batched discard ioctl()
is then the only way to change that.


The comment there was also a bit updated, here is it:

/* when mounted with online discard, dbFree() will
 * call jfs_issue_discard() itself */


Also the minblks_trim value changes both:
1) the trim via ioctl()
2) the online discard

With small values (1..8) of online discarding, JFS will get really slow,
which is not my aim ;)

On my notebook, I will mount JFS without the discard mount option and
call the batched discard via fstrim(8) sometimes... this seems for my
usage better. But big SAN devices or other users my find online
discarding better... maybe I will use discard=64, this would only trim
blocks of 256KiB or bigger... which is okay. All the smaller values
would just make everything slow.

Maybe, the other filesystems change their online discard option also.
Other notions about that?


patches are also located here:
http://www.mcmilk.de/projects/jfs-trim/linux-tree/


-- 
regards, TR

View attachment "jfs-trim-2012-08-01.diff" of type "text/plain" (21058 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ