linux-kernel - Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87f94c370908131719l7d84c5d0x2157cfeeb2451bce@mail.gmail.com>
Date:	Thu, 13 Aug 2009 20:19:57 -0400
From:	Greg Freemyer <greg.freemyer@...il.com>
To:	Richard Sharpe <realrichardsharpe@...il.com>
Cc:	david@...g.hm, Markus Trippelsdorf <markus@...ppelsdorf.de>,
	Matthew Wilcox <willy@...ux.intel.com>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Nitin Gupta <ngupta@...are.org>, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	linux-scsi@...r.kernel.org, linux-ide@...r.kernel.org,
	Linux RAID <linux-raid@...r.kernel.org>
Subject: Re: Discard support (was Re: [PATCH] swap: send callback when swap 
	slot is freed)

On Thu, Aug 13, 2009 at 6:20 PM, Richard
Sharpe<realrichardsharpe@...il.com> wrote:
> On Thu, Aug 13, 2009 at 2:28 PM, Greg Freemyer<greg.freemyer@...il.com> wrote:
>> On Thu, Aug 13, 2009 at 4:44 PM, <david@...g.hm> wrote:
>>> On Thu, 13 Aug 2009, Greg Freemyer wrote:
>>>
>>>> On Thu, Aug 13, 2009 at 12:33 PM, <david@...g.hm> wrote:
>>>>>
>>>>> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>>>>>
>>>>>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>>>>>
>>>>>>> I am planning a complete overhaul of the discard work.  Users can send
>>>>>>> down discard requests as frequently as they like.  The block layer will
>>>>>>> cache them, and invalidate them if writes come through.  Periodically,
>>>>>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>>>>>> underlying device) and get rid of the blocks that have remained
>>>>>>> unwanted
>>>>>>> in the interim.
>>>>>>
>>>>>> That is a very good idea. I've tested your original TRIM implementation
>>>>>> on
>>>>>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>>>>>> milliseconds to digest a single TRIM command. And since your
>>>>>> implementation
>>>>>> sends a TRIM for each extent of each deleted file, the whole system is
>>>>>> unusable after a short while.
>>>>>> An optimal solution would be to consolidate the discard requests, bundle
>>>>>> them and send them to the drive as infrequent as possible.
>>>>>
>>>>> or queue them up and send them when the drive is idle (you would need to
>>>>> keep track to make sure the space isn't re-used)
>>>>>
>>>>> as an example, if you would consider spinning down a drive you don't hurt
>>>>> performance by sending accumulated trim commands.
>>>>>
>>>>> David Lang
>>>>
>>>> An alternate approach is the block layer maintain its own bitmap of
>>>> used unused sectors / blocks. Unmap commands from the filesystem just
>>>> cause the bitmap to be updated.  No other effect.
>>>
>>> how does the block layer know what blocks are unused by the filesystem?
>>>
>>> or would it be a case of the filesystem generating discard/trim requests to
>>> the block layer so that it can maintain it's bitmap, and then the block
>>> layer generating the requests to the drive below it?
>>>
>>> David Lang
>>
>> Yes, my thought.was that block layer would consume the discard/trim
>> requests from the filesystem in realtime to maintain the bitmap, then
>> at some later point in time when the system has extra resources it
>> would generate the calls down to the lower layers and eventually the
>> drive.
>
> Why should the block layer be forced to maintain something that is
> probably of use for only a limited number of cases? For example, the
> devices I work on already maintain their own mapping of HOST-visible
> LBAs to underlying storage, and I suspect that most such devices do.
> So, you are duplicating something that we already do, and there is no
> way that I am aware of to synchronise the two.
>
> All we really need, I believe is for the UNMAP requests to come down
> to us with writes barriered until we respond, and it is a relatively
> cheap operation, although writes that are already in the cache and
> uncommitted to disk present some issues if an UNMAP request comes down
> for recently written blocks.
>

Richard,

Quoting the original email I saw in this thread:

>
>The unfortunate thing about the TRIM command is that it's not NCQ, so
>all NCQ commands have to finish, then we can send the TRIM command and
>wait for it to finish, then we can send NCQ commands again.
>
>So TRIM isn't free, and there's a better way for the drive to find
>out that the contents of a block no longer matter -- write some new
>data to it.  So if we just swapped a page in, and we're going to swap
>something else back out again soon, just write it to the same location
>instead of to a fresh location.  You've saved a command, and you've
>saved the drive some work, plus you've allowed other users to continue
>accessing the drive in the meantime.
>
>I am planning a complete overhaul of the discard work.  Users can send
>down discard requests as frequently as they like.  The block layer will
>cache them, and invalidate them if writes come through.  Periodically,
>the block layer will send down a TRIM or an UNMAP (depending on the
>underlying device) and get rid of the blocks that have remained unwanted
>in the interim.
>
>Thoughts on that are welcome.
>>

My thought was that a bitmap was a better solution than a cache of
discard commands.

One of the biggest reasons is that a bitmap can coalesce the unused
areas into much larger discard ranges than a queue that will only have
a limited number of discards to coalesce.

And both Enterprise scsi and mdraid are desirous of larger discard ranges.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/