[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20061219.171026.115904158.k-ueda@ct.jp.nec.com>
Date: Tue, 19 Dec 2006 17:10:26 -0500 (EST)
From: Kiyoshi Ueda <k-ueda@...jp.nec.com>
To: jens.axboe@...cle.com, agk@...hat.com, mchristi@...hat.com,
linux-kernel@...r.kernel.org, dm-devel@...hat.com
Cc: j-nomura@...jp.nec.com, k-ueda@...jp.nec.com
Subject: [RFC PATCH 0/8] rqbased-dm: request-based device-mapper
Hello,
I'm working on device-mapper multipath (dm-multipath).
This patch set adds a new hook for device-mapper below I/O scheduler
and enables mapping at request level instead of bio level.
The patch could be a basis of better dynamic load balancing.
This patch set is preliminary tested on active-active 2 paths storage.
But the patch set still needs work and is not ready for inclusion.
I'm posting it because I'd like to get comments about high-level
design before going further in details.
The list below is the items which I'd especially like to get comments.
For block layer maintainer and developers:
This patch set has 2 block layer changes below.
- Changed blk_get_request() to allow calls from interrupt context
so that queue's request_fn can use it. (PATCH 1/8)
(*) The behaviour of CFQ (or other scheduler which depends on
"current") may be affected when blk_get_request() is called
in interrupt context, because "current" is not the process
which issue the original request.
- Added new "end_io_first" hook to __end_that_request_first()
and struct request. (PATCH 2/8)
And I'm thinking about:
- Moving blk_clone_rq() to ll_rw_blk.c from dm.c. (PATCH 7/8)
Are these acceptable changes?
For dm maintainer and developers:
- About splitting 'map' method into 'prep_map' and 'map'.
- About I/O spanning across targets.
Please see "Possible discussion items" section below for details.
This patch set should be applied on top of 2.6.19.1.
====================================================================
Background
=-=-=-=-=-=
Current device-mapper is bio-based and dm-multipath has some issues
below.
- Because hook for I/O mapping is above block layer __make_request(),
contiguous bios can be mapped to different underlying devices
and these bios aren't merged into a request.
Dynamic load balancing could happen this situation, though
it has not been implemented yet.
Therefore, I/O mapping after bio merged is needed for better
dynamic load balancing.
- There is no feature of error code (sense key) escalation
to device-mapper from SCSI layer, though storage dependent error
code handling is needed for some storages.
Though there was a patch piggybacking error code to struct bio,
it was rejected and the comment at the time was "struct request
would be better if this feature is implemented."
To resolve the issues, the block layer (request-based) multipath
patch was posted by Mike Christie before.
(http://marc.theaimsgroup.com/?l=linux-scsi&m=115520444515914&w=2)
Though Mike's patch added new block layer device for multipath and
didn't have device-mapper interface, I modified his patch to be used
from dm-multipath.
=====================================================================
Design Overview
=-=-=-=-=-=-=-=-=
Overview of the request-based device-mapper patch:
- Mapping is done in a unit of struct request, instead of struct bio.
- Hook for I/O mapping is at request_fn after merging and sorting by
I/O scheduler, instead of make_request_fn.
- Hook for I/O finishing is at end_that_request_*, instead of bio_endio.
- Whether the dm device is bio-based or request-based is specified
at device creation by ioctl parameter.
- Keep user interface same (table/message/status).
- Any dm/md devices can be stacked on request-based dm device.
Request-based dm device *cannot* be stacked on bio-based dm device.
- Use block device queue instead of multipath target's internal queue.
- Currently no work is done for hw_handler.
Mike Christie is moving them to scsi layer.
- Difference in the target driver methods:
current (bio-based) this patch (request-based)
----------------------------------------------------------
map prep_map (decides target device)
map (translate cloned request)
end_io end_io_first (error check)
end_io (free memory/retry)
Expected benefit:
- better load balancing
- affinity to I/O scheduler
- user space tools need minimum change
- could be a basis of error code escalation feature
=====================================================================
Possible discussion items
=-=-=-=-=-=-=-=-=-=-=-=-=-=
About splitting 'map' method into 'prep_map' and 'map'
-------------------------------------------------------
In current bio-based dm, clone of bio is made in dm core and
passed to target's map function.
Whereas in request-based dm, clone of request must be gotten
from mapped underlying device's queue.
So I added prep_map function for targets to decide devices
to which the I/O should be directed in advance so that dm core
can get clone of request before map function call.
Though I think this prep_map approach is the best way,
I'd like to get comments if you have any other ideas.
About I/O spanning across targets
----------------------------------
Currently, splitting of I/O spanning across targets has not
implemented yet, but it should be needed.
There are 2 ways to implement it:
1). Do in request_fn (request level splitting)
2). Do in make_request_fn (bio level splitting)
Request level splitting is difficult because:
- Need to split bios in the request too
- There is an assumption in block layer that request finishes
from head to tail in order. If the request is splitted and
the latter half finishes first, it breaks this assumption.
Changing it will require major modification in block layer.
Bio level splitting can be done in the following way:
- Create dm_make_request() and set it for make_request_fn
- Like what dm_request() currently does, dm_make_request() will
split bio and clone. NO_MERGE flag is set for the cloned bio
so that it can't be merged again in the generic __make_request().
- The cloned bio is taken over to the I/O scheduler of the mapped
device by calling __make_request() for it as same as bio not
spanning targets.
- When the cloned bio is returned, end_io hook function is called
and wait for finishing all splitted clones.
(This is like current clone_endio() implementation.)
So I think bio level splitting is better.
What do you think about?
=====================================================================
TODO
=-=-=
o Support I/O spanning across targets (dm core)
o Support noflush suspend (dm core)
o Support HW handler for path changing (multipath)
Thanks in advance,
Kiyoshi Ueda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists