lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1174351366.13341.739.camel@localhost.localdomain>
Date:	Tue, 20 Mar 2007 01:42:46 +0100
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Matt Mackall <mpm@...enic.com>
Cc:	Josh Boyer <jwboyer@...ux.vnet.ibm.com>,
	Artem Bityutskiy <dedekind@...radead.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Frank Haverkamp <haver@...t.ibm.com>,
	Christoph Hellwig <hch@...radead.org>,
	David Woodhouse <dwmw2@...radead.org>
Subject: Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> > > If a static volume is simply a non-dynamic volume, then device mapper
> > > can do that too. And countless other things. Which is not an aside.
> > > UBI growing to do all the things that device mapper does is exactly
> > > the thing we should be seeking to avoid.
> > 
> > No it can't and device mapper sits on top of block devices. FLASH is no
> > block device. Period.
> 
> Which of the following two properties does it lack?
> 
> - discrete blocks
> - non-sequential access to blocks
> 
> When you do the obvious s/blocks/eraseblocks/, this appears to be
> true.

It appears to be, but it is not. You enforce semantics on a device,
which it does not have.

> Saying "but I can't do I/O smaller than the blocksize" doesn't change
> this any more than it would for disks.

There is a huge difference. Disk block size is 512 byte and FLASH block
size is min 16KiB and up to 256KiB.

Just do the math:

Write sampling data streams in 2KiB chunks to your uber devicemapper on
a 1GiB device with 64KiB erase block size:

Fine grained FLASH aware writes allow 32 chunks in a block without
erasing the block.

Your method erases the block 32 times to write the same amount of data.

Result: You wear out the flash 32 times faster. Cool feature.

> Saying "but I can do smaller I/O efficiently in some circumstances"
> also doesn't change it.

We can do it under _any_ circumstances and that _does_ change it.
Implementing a clever block device layer on top of UBI is simple and
would provide FLASH page sized I/O, i.e. 2Kib in the above example.

> In historical UNIX, some tapes were block devices too. Because they
> supported seek().

I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?

Your next proposal is to throw away MTD-utils and use "mt" instead ?

> > Device mapper can not provide a simple easy to decode scheme for boot
> > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > and be able to find the kernel or second stage boot loader in this
> > unordered device.
> > 
> > And no, fixed addresses do not work. Do you want to implement device
> > mapper into your Initialial Bootloader stage ?
> 
> This is exactly the same problem as booting on a desktop PC. But
> somehow LILO manages. My first Linux box had a hell of a lot less disk
> than the platform I bootstrapped (and wrote NAND drivers for) last
> month had in NAND.

No, it is not. You get the absolute sector address of your second stage
and this is a complete nobrainer. The translation is done in the DISK
device.

You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
whatever - there is a more or less intellegent controller device, which
does the mapping to the physical storage location. There is _NO_ such
thing on a bare FLASH chip.

It does not matter, whether your embedded device had more NAND space
than my old CP/M machines floppy. It simply matters, that even the old
CP/M floppy device had some rudimentary intellence on board.

Furthermore I want to be able to get the bitflip correction on my second
stage loader / kernel in the same safe way as we do it for everything
else and still be able to bootstrap that from an extremly small
bootloader.

> > > If the right way is instead to extend the block layer and device
> > > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > > should not go in.
> > 
> > No, block layer on top of FLASH needs 80% of the functionality of UBI in
> > the first place.
> 
> Incorrect. A block-based filesystem on top of flash needs this
> functionality. But a block device suitable to device mapper layering
> (which then provides the functionality) does not.

How exactly does device mapper:

A) across device wear levelling ?
B) dynamic partitioning for FLASH aware file systems ?
C) across device wear levelling for FLASH aware file systems ?
D) background bit-flip corrections (copying affected blocks and recylce
the old one) ?
E) allow position independent placement of the second stage bootloader ?

> > You need to implement a clever journalling block device
> > emulator in order to keep the data alive and the FLASH not weared out
> > within no time. You need the wear levelling, otherwise you can throw
> > away your FLASH in no time.
> 
> And that's why it's in my picture.

Yes, it is in your picture, but:

1) it excludes FLASH aware file systems and UBI does not.
2) your picture does still not explain how it does achive the above A),
B), C), D) and E)

Your extra path for partitioning(4) and JFFS2 is just a weird hack,
which makes your proposal completely absurd.

> > > Let me draw a picture so we have something to argue about:
> > > 
> > >                      iSCSI/nbd(6)
> > >                           |
> > > filesystem {        swap  |  ext3        ext3     jffs2
> > >                       \   |   |            |       /
> > >                /       \  | dm-crypt->snapshot(5) /
> > > device mapper -|        \ \   |                  /
> > >                |         partitioning           /
> > >                |              |          partitioning(4)
> > >                |        wear leveling(3)  /
> > >                |              |          /
> > >                |      block concatenation
> > >                |       |    |    |     |
> > >                \      bad block remapping(2)   
> > >                        |    |    |     |
> > > MTD raw block {     raw block devices with no smarts(1)
> > >                       /     |     \      \
> > > hardware {         NAND    NAND   NAND   NAND
> > > Notes:


Let me draw an UBI picture:

                                   VFS -Layer
                                       |               
                             (Future VFS-crypto)			
                            /                    \
                           |                      |
          ______ device mapper / fs               |
        /                 |                       |
      /                   |                       |
     |          Do whatever you don't             |
     |          want to do with FLASH             |
     |          whether it makes sense            |
     |          or not.                           |
     |                     |                      |
Block layer   UBI block device    Flash aware filesystems
     |                      \    /
   device                    \  /
   driver                      |
     |                         |
_____|_______                  |
|           |                  |
| Device    |                 UBI
| resident  |                  |
| "UBI"     |                  |
|___________|                  |
                               |                     
                            MTD-CORE
                               |      
                       nand base driver
                    /          |         \
             device driver  device driver  device driver
                   |              |
                 NAND            NAND          NAND

No notes: it's simple and self explaining.

> > > 1. This would provide a block device that allowed writing pages and
> > >    a secondary method for erasing whole blocks as well as a method for
> > >    querying/setting out of band information.
> > 
> > Forget about OOB data. OOB data is reserved for ECC. Please read the
> > recommendations of the NAND FLASH manufacturers. NAND gets less reliable
> > with higher density devices and smaller processes.
> > 
> > > 2. This would hide erase blocks either by using an embedded table or
> > >    out of band info. This could stack on top of block concatenation if
> > >    desired.
> > 
> > Hide erase blocks ? UBI does not hide anything. It maps logical
> > eraseblocks, which are exposed to the clients to arbitrary physical
> > eraseblocks on the FLASH device in order to provide across device wear
> > levelling.
> 
> Sorry, I meant hiding bad blocks here. That's why this layer was
> labeled "bad block remapping".

Oh well, a seperate bad block remapper. And how is the logical mapping
of wear levelled erase blocks done ?

> > > 3. This would provide wear leveling, and probably simultaneously
> > >    provide relatively efficient and safe access to write sector 
> > >    and page-sized I/O. Below this level, things had better be
> > >    comfortable with the limitations of NAND if they want to work well.
> > 
> > I don't see how this provides across device wear levelling.
> 
> Because the layer immediately beneath it ("block concatenation") takes
> N devices and presents one logical device.

And how is the wear levelling done on this logical device in
devicemapper ? 

How is ensured, that the wear average is maintained also across the
partitions which are used by JFFS2 or other FLASH aware filesystems ?

> > > 4. JFFS2 has its own wear-leving scheme, as do several other
> > >    filesystems, so they probably want to bypass this piece of the stack.
> > 
> > JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> > wear levelling sucks. 
> 
> Ok, fine. How about LogFS, then?

LogFS can easily leverage UBI's wear algorithm.

> > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > >    snapshot, etc.).
> > 
> > Why should we reimplement that ?
> 
> So that you can get encryption and snapshot, etc.?

1. On top of a clever block device.

2. UBI can do snapshots by design.

3. Encryption should be done on the VFS layer and not below the
filesystem layer. Doing it inside the block layer or the device mapper
is broken by design.

> > > 6. We make some things possible that simply aren't otherwise.
> > >
> > > And this picture isn't even interesting yet. Imagine a dm-cache layer
> > > that caches data read from disks in high-speed flash. Or using
> > > dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> > > Neither of these can be done 'right' in a stack split between device
> > > mapper and UBI.
> > 
> > Err. Implement a clever block layer on top of UBI and use all the
> > goodies you want including device mapper.
> 
> If I wanted to have both device mapper and device mapper's little
> brother in my kernel, I wouldn't have started this thread.

You still did not explain how devicemapper does:

- across device wear levelling
- dynamic partitioning for FLASH aware file systems
- across device wear levelling for FLASH aware file systems
- simple boot loader support
- fine grained I/O

UBI is not devicemappers little brother. It is the software version of
the silicon in a CF-CARD / USB-Stick, but it does a better job and
allows clever usage of FLASH aside of enforcing eraseblock sized I/O
units. Does your CF-Card / USB-Stick do that ? 

Just think about the 1GiB USB stick, which would present you 64KiB I/O
units instead of 2KiB ones.

Your signature is a nice intellectual signboard, but the ancient simple
rule of three just tells me, that you are off by factor 32.

	tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ