On Mon, 13 Dec 2010 19:02:17 -0600 C Anthony Risinger <anthony@extof.me> wrote:
On Sun, Dec 12, 2010 at 9:29 AM, Dieter Plaetinck <dieter@plaetinck.be> wrote:
Anthony and other interested folks, I've been looking a bit further, and it seems like btrfs support shouldn't be too hard to implement. It actually seems simpler then LVM (because lvm has 3 levels: PV,VG and LV; btrfs has just the btrfs itself (~default subvolume) and other subvolumes) subvolumes don't get a new devicefile but i'll probably use something like: /dev/sda:$spec to denote what's what. if $spec is a number, it will become mount option subvolid=$spec; otherwise subvol=$spec although from what i can tell, the id's aren't used often. and it seems more robust to me use names anyway.
yeah, that would work fine; should be simpler than LVM. the problem with mounting by name is that it only works when "name" is in the btrfs root (the real root, subvolid=0). ie:
/subvol
... works, but not ...
/nested/subvol
is this the only limitation with named subvols? i'm probably missing something, but since the real btrfs is basically used as a container which will contain subvolumes which are the ones you will mount in your filesystem (in arbitrary places), why does it matter where in the btrfs tree the subvolumes are defined? what's wrong with putting subvolumes in the btrfs root? it's too bad separate subvolumes don't get their own devicefiles (and hence, associated /dev/by-uuid/ and /dev/by-label/ symlinks) that would make my life easier. are the id's "stable"? i.e, suppose i initialise variable last_id=0 for every subvolume the user wants to create, can i just do id=last_id+1 and assume that id will stay the same for this volume? (at least for the duration of the installation, i.e. not worrying about snapshot rollbacks etc) (and apparently, skip id 5 because that's already taken)
the hook i'm soon to release doesn't support names; it's just too inflexible. btw, for clarity to anyone else, the default subvol is not the same as the btrfs root (though initially they are the same). default subvol is any subvol marked as the _mount_ default (and later mountable via `subvol=.` or none at all)... the real root will always be subvolid = 0 or 5.
subvolid 5 ?
* which are the requirements your btrfs_advanced mkinitcpio hook implies? what things does aif need to do other then just doing mkfs.btrfs to get the full potential out of btrfs/your hook? please explain why a default btrfs configuration does not suffice. does it have something to do with https://btrfs.wiki.kernel.org/index.php/UseCases#Can_a_snapshot_be_replaced_... ?
it's sort of related to that i think. the reeeeeaaaaalllly messy part is what to do when a user has installed the system into the btrfs root, instead of a dedicated subvol. the issue is the btrfs root is not movable/editable/replaceable; all other subvols can be moved/renamed/deleted/etc... except the root. thus, there is no clean way to programatically "move" the system (in preparation for rollback/manage snapshots/etc.). everything in / must be rm -rf'ed manually or it will ultimately become dead space. i've brought this up probably 5 different times to the list be never get any response :-(
the hook (and other impls i'd assume) use the btrfs root for volume management, the "sub-root". the actual "system root" is just one of many subvols in the pool, and may change between reboots. at the very least, if AIF created a subvol, marked as default, and installed into that subvol, my hook could then safely "rotate" the user into a more advanced configuration...
should i give it a specific name? or just a subvol marked as default? what kind of advanced config do you mean? any stuff that makes more sense to be done during the installation step? or does it become too specific to your hook?
i just need the system in a subvol. the only difference user sees by this procedure (dedicated subvol by default) is a "mysterious" directory when they run "btrfs subvolume list" that doesn't seem to exist :-) because it's actually underneath their /.
but really, under no cases do i think the system should be installed into the btrfs root, i wouldn't even offer it at install time. if use wants that they can do it themselves... they will be happy it's in a subvol.
okay, fair enough. i will make it so that you can't choose a mointpoint for the actual btrfs, only for subvolumes. if the btrfs guys ever make things more flexible, it's fairly trivial for us to adapt aif/your hook.
* I've read a bit more about btrfs and I think an implementation like this will suffice for most users: - allow creation of a btrfs on top of 1-n blockdevices (user can pick raid levels for data and metadata) - allow creation of 0-m subvolumes - each subvolume as well as the default can get an arbitrary mountpoint, as well as specific mount options like compress, ssd, etc. if i understood correctly, that is.
yup, i thinks that's everything for now! ssd should enable automatically when btrfs detects non rotating media. and ssd_spread is for cheaper flash i believe... i forget what the reason was. compress we should be sure to note the CPU overhead of zlib (though LZO patches will be in next kernel i believe, exciting), though for many systems it may not matter.
okay, but as per your advice in the previous paragraph, users won't be able to select a mointpoint for the btrfs itself, only the subvolumes. (or maybe I'll just "discourage" them with a warning message)
However, to be fully compatible with your hook, I will probably "strongly recommend" to create a subvolume __active and mount that as / Right? anything I missed?
in the newer setup __active isn't used anymore; i don't intend to develop on that configuration anymore, and will phase anyone out in favor of this upcoming release. the new structure looks like this:
---------------------------------------------------------------------------------
/var/lib/btrfsadm |-- boot | |-- extlinux.conf | `-- vesamenu.c32 |-- HEAD -> refs/rw/PRI |-- pool | |-- FREE -> /dev/disk/by-label/btrfs-pool-free | `-- SELF -> /dev/disk/by-label/btrfs-pool-self |-- refs | |-- ro | | |-- log | | | |-- 1291021356 -> ../../../vols/260 | | | |-- 1291056164 -> ../../../vols/261 | | | `-- 1291102035 -> ../../../vols/262 | | `-- usr | | `-- ORIG -> ../../../vols/260 | `-- rw | |-- PRI -> ../../vols/262 | |-- SEC -> ../../vols/261 | `-- usr `-- vols |-- 260 | |-- boot | | |-- kernel26-fallback.img | | |-- kernel26.img | | |-- System.map26 | | `-- vmlinuz26 | `-- fs (THIS IS A SUBVOL) |-- 261 | |-- boot | | |-- kernel26-fallback.img | | |-- kernel26.img | | |-- System.map26 | | `-- vmlinuz26 | `-- fs (THIS IS A SUBVOL) `-- 262 |-- boot | |-- kernel26-fallback.img | |-- kernel26.img | |-- kxloader.img | |-- System.map26 | `-- vmlinuz26 `-- fs (THIS IS THE ACTIVE SYSTEM ROOT)
---------------------------------------------------------------------------------
so... while much more involved, it's still is very simple and 1000x more flexible. heavily inspired by the .git directory setup.
a quick breakdown:
/boot this is the real boot device; can be a separate partition/disk, multiple disks, or on the same btrfs FS (currently extlinux only). also used for a 2-stage boot -- a kernel based "bootramfs" bootloader is used to mount, find, and kexec the real kernel within a snapshot, since standard bootloaders can't see inside subvols yet.
/HEAD a symlink to a symlink. HEAD points to the active ref (or directly to a subvol, the git equivalent of a "detached head"), which points to a particular subvol. at at given time, when the system is running, HEAD will _always_ point to the current subvol in use.
/pool symlinks to ourself (SELF -- the active btrfs pool), and any others (FREE will be used in the future if available to "steal" devices; this will enable hot spares and automatic array repair)
/refs a hierarchy of symlinks into the /vols directory. for every subvol the user has, a symlink in here will exist. there will also be some system managed ones (such as "log"... which is autosnap on reboot, if enabled). ORIG=snapshot after install, PRI=primary system root, SEC=the previous system root. user can manage these with the upcoming btrfsadm tool.
/vols all the actual subvols. named by id. the above `tree` shows a "detached boot" state... where boot is outside the fs. this setup enables extlinux (and others potentially) to perform kernel level rollbacks without the use of a 2-stage boot process, but requires /boot (from within the system) to be a symlink:
# mount ... /dev/sda on /var/lib/btrfsadm type btrfs (rw,noatime,subvolid=0) ...
# ls -l /boot lrwxrwxrwx 1 root root 26 Nov 29 03:11 /boot -> var/lib/btrfsadm/HEAD/boot
this way, mkinitcpio and friends work, and copy the kernel to the proper detached boot by dereferencing HEAD. also, since extlinux can follow symlinks, simply pointing to HEAD or other refs in extlinux.conf works (must be under 255 chars). ultimately this is a workaround for bootloaders unable to handle btrfs or btrfs subvols, but it works very well, and is easy to move to an "inclusive boot" later on when bootloader support is better.
o_O O_o so, to paraphrase: in your hook, you build this kind of tree structure based on the btrfs devices you find (/pool) and subvolumes (/vols), and create some symlinks to organize everything (/refs, /HEAD); the idea being that this will make things more simple during the hook processing. is this structure in memory only during execution of your hook, or does it all get written to disk (the btrfs root?) so that the real booted system will see it also? right?
---------------------------------------------------------------------------------
i know that's a lot of information, and probably more than needed, but i've been meaning to write it down anyway :-)
let me know how you think that could jive with AIF.
Well, the /var/lib/btrfsadm tree you described seems fairly non-standard, but you seem to know what you're doing. If I understood correctly I don't need to worry about the /var/lib/btrfsadm tree, right? So you can do your thing and I'll do mine, making sure to strongly recommend users to put all btrfs mountpoints in separate subvolumes.
C Anthony
Dieter