[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.02.1401302116180.9767@file01.intranet.prod.int.rdu2.redhat.com>
Date: Thu, 30 Jan 2014 21:43:07 -0500 (EST)
From: Mikulas Patocka <mpatocka@...hat.com>
To: James Bottomley <James.Bottomley@...senPartnership.com>
cc: Jens Axboe <axboe@...nel.dk>,
"Alasdair G. Kergon" <agk@...hat.com>,
Mike Snitzer <msnitzer@...hat.com>, dm-devel@...hat.com,
"David S. Miller" <davem@...emloft.net>, linux-ide@...r.kernel.org,
linux-scsi@...r.kernel.org, linux-kernel@...r.kernel.org,
Neil Brown <neilb@...e.de>, linux-raid@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH] block devices: validate block device capacity
On Thu, 30 Jan 2014, James Bottomley wrote:
> > A device may be accessed direcly (by opening /dev/sdX) and it creates a
> > mapping too - thus, the size of a mapping limits the size of a block
> > device.
>
> Right, that's what I suspected below. We can't damage large block
> support on filesystems just because of this corner case.
Devices larger than 16TiB never worked on 32-bit kernel, so this patch
isn't damaging anything.
Note that if you attach a 16TiB block device, don't open it and mount it,
it still won't work, because the buffer cache uses the page cache (see the
function __find_get_block_slow and the variable "pgoff_t index" - that
variable would overflow if the filesystem accessed a buffer beyond 16TiB).
> > The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may
> > fix it - but there may be some hidden places where pgoff is converted to
> > unsigned long - who knows, if they exist or not?
>
> I don't think we want to do that ... it will make struct page fatter and
> have knock on impacts in the radix tree code. To fix this, we need to
> make the corner case (i.e. opening large block devices without a
> filesystem) bear the pain. It sort of looks like we want to do a linear
> array of mappings of 64TB for the device so the page cache calculations
> don't overflow.
The code that reads and writes data to block devices and files is shared -
the functions in mm/filemap.c work for both files and block devices.
So, if you want 64-bit page offsets, you need to increase pgoff_t size,
and that will increase the limit for both files and block devices.
You shouldn't have separate functions for managing pages on files and
separate functions for managing pages on block devices - that would
increase code size and cause maintenance problems.
> > Though, we need to know if the people who designed memory management agree
> > with changing pgoff_t to 64 bits.
>
> I don't think we can change the size of pgoff_t ... because it won't
> just be that, it will be other problems like the radix tree.
If we can't change it, then we must stay with the current 16TiB limit.
There's no other way.
> However, you also have to bear in mind that truncating large block
> device support to 64TB on 32 bits is a technical ABI break. Hopefully
> it is only technical because I don't know of any current consumer block
> device that is 64TB yet, but anyone who'd created a filesystem >64TB
> would find it no-longer mounted on 32 bits.
> James
It is not ABI break, because block devices larger than 16TiB never worked
on 32-bit architectures. So it's better to refuse them outright, than to
cause subtle lockups or data corruption.
Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists