Re: [arch-releng] btrfs support in aif

14 Dec 2010

      On Mon, 13 Dec 2010 19:02:17 -0600
C Anthony Risinger <anthony@extof.me> wrote:
...
On Sun, Dec 12, 2010 at 9:29 AM, Dieter Plaetinck
<dieter@plaetinck.be> wrote:
...
Anthony and other interested folks,
I've been looking a bit further, and it seems like btrfs support
shouldn't be too hard to implement.  It actually seems simpler then
LVM (because lvm has 3 levels: PV,VG and LV; btrfs has just the
btrfs itself (~default subvolume) and other subvolumes)
subvolumes don't get a new devicefile but i'll probably use
something like:
/dev/sda:$spec
to denote what's what.
if $spec is a number, it will become mount option subvolid=$spec;
otherwise subvol=$spec
although from what i can tell, the id's aren't used often. and it
seems more robust to me use names anyway.
yeah, that would work fine; should be simpler than LVM.  the problem
with mounting by name is that it only works when "name" is in the
btrfs root (the real root, subvolid=0). ie:
/subvol
... works, but not ...
/nested/subvol
is this the only limitation with named subvols?
i'm probably missing something, but since the real btrfs is basically
used as a container which will contain subvolumes which are the
ones you will mount in your filesystem (in arbitrary places), why does
it matter where in the btrfs tree the subvolumes are defined? what's
wrong with putting subvolumes in the btrfs root?

it's too bad separate subvolumes don't get their own devicefiles (and
hence, associated /dev/by-uuid/ and /dev/by-label/ symlinks) that
would make my life easier.
are the id's "stable"? i.e, suppose i initialise variable
last_id=0
for every subvolume the user wants to create, can i just do
id=last_id+1 and assume that id will stay the same for this volume?
(at least for the duration of the installation, i.e. not worrying about
snapshot rollbacks etc) (and apparently, skip id 5 because that's
already taken)
...
the hook i'm soon to release doesn't support names; it's just too
inflexible.  btw, for clarity to anyone else, the default subvol is
not the same as the btrfs root (though initially they are the same).
default subvol is any subvol marked as the _mount_ default (and later
mountable via `subvol=.` or none at all)... the real root will always
be subvolid = 0 or 5.
subvolid 5 ?
...
...
* which are the requirements your btrfs_advanced mkinitcpio hook
 implies?  what things does aif need to do other then just doing
 mkfs.btrfs to get the full potential out of btrfs/your hook?
 please explain why a default btrfs configuration does not suffice.
 does it have something to do with
https://btrfs.wiki.kernel.org/index.php/UseCases#Can_a_snapshot_be_replaced_... ?
it's sort of related to that i think.  the reeeeeaaaaalllly messy part
is what to do when a user has installed the system into the btrfs
root, instead of a dedicated subvol.  the issue is the btrfs root is
not movable/editable/replaceable; all other subvols can be
moved/renamed/deleted/etc... except the root.  thus, there is no clean
way to programatically "move" the system (in preparation for
rollback/manage snapshots/etc.).  everything in / must be rm -rf'ed
manually or it will ultimately become dead space.  i've brought this
up probably 5 different times to the list be never get any response
:-(
the hook (and other impls i'd assume) use the btrfs root for volume
management, the "sub-root".  the actual "system root" is just one of
many subvols in the pool, and may change between reboots.  at the very
least, if AIF created a subvol, marked as default, and installed into
that subvol, my hook could then safely "rotate" the user into a more
advanced configuration...
should i give it a specific name? or just a subvol marked as default?
what kind of advanced config do you mean? any stuff that makes more
sense to be done during the installation step? or does it become too
specific to your hook?
...
i just need the system in a subvol.  the
only difference user sees by this procedure (dedicated subvol by
default) is a "mysterious" directory when they run "btrfs subvolume
list" that doesn't seem to exist :-) because it's actually underneath
their /.
but really, under no cases do i think the system should be installed
into the btrfs root, i wouldn't even offer it at install time.  if use
wants that they can do it themselves... they will be happy it's in a
subvol.
okay, fair enough. i will make it so that you can't choose a mointpoint
for the actual btrfs, only for subvolumes.  if the btrfs guys ever make
things more flexible, it's fairly trivial for us to adapt aif/your hook.
...
...
* I've read a bit more about btrfs and I think an implementation
like this will suffice for most users:
 - allow creation of a btrfs on top of 1-n blockdevices (user can
   pick raid levels for data and metadata)
 - allow creation of 0-m subvolumes
 - each subvolume as well as the default can get an arbitrary
   mountpoint, as well as specific mount options like
   compress, ssd, etc.  if i understood correctly, that is.
yup, i thinks that's everything for now!  ssd should enable
automatically when btrfs detects non rotating media.  and ssd_spread
is for cheaper flash i believe... i forget what the reason was.
compress we should be sure to note the CPU overhead of zlib (though
LZO patches will be in next kernel i believe, exciting), though for
many systems it may not matter.
okay, but as per your advice in the previous paragraph, users won't be
able to select a mointpoint for the btrfs itself, only the subvolumes.
(or maybe I'll just "discourage" them with a warning message)
...
...
 However, to be fully compatible with your hook, I will probably
 "strongly recommend" to create a subvolume __active and mount that
 as / Right? anything I missed?
in the newer setup __active isn't used anymore; i don't intend to
develop on that configuration anymore, and will phase anyone out in
favor of this upcoming release.  the new structure looks like this:
---------------------------------------------------------------------------------
/var/lib/btrfsadm
|-- boot
|   |-- extlinux.conf
|   `-- vesamenu.c32
|-- HEAD -> refs/rw/PRI
|-- pool
|   |-- FREE -> /dev/disk/by-label/btrfs-pool-free
|   `-- SELF -> /dev/disk/by-label/btrfs-pool-self
|-- refs
|   |-- ro
|   |   |-- log
|   |   |   |-- 1291021356 -> ../../../vols/260
|   |   |   |-- 1291056164 -> ../../../vols/261
|   |   |   `-- 1291102035 -> ../../../vols/262
|   |   `-- usr
|   |       `-- ORIG -> ../../../vols/260
|   `-- rw
|       |-- PRI -> ../../vols/262
|       |-- SEC -> ../../vols/261
|       `-- usr
`-- vols
    |-- 260
    |   |-- boot
    |   |   |-- kernel26-fallback.img
    |   |   |-- kernel26.img
    |   |   |-- System.map26
    |   |   `-- vmlinuz26
    |   `-- fs (THIS IS A SUBVOL)
    |-- 261
    |   |-- boot
    |   |   |-- kernel26-fallback.img
    |   |   |-- kernel26.img
    |   |   |-- System.map26
    |   |   `-- vmlinuz26
    |   `-- fs (THIS IS A SUBVOL)
    `-- 262
        |-- boot
        |   |-- kernel26-fallback.img
        |   |-- kernel26.img
        |   |-- kxloader.img
        |   |-- System.map26
        |   `-- vmlinuz26
        `-- fs (THIS IS THE ACTIVE SYSTEM ROOT)
---------------------------------------------------------------------------------
so... while much more involved, it's still is very simple and 1000x
more flexible.  heavily inspired by the .git directory setup.
a quick breakdown:
/boot
this is the real boot device; can be a separate partition/disk,
multiple disks, or on the same btrfs FS (currently extlinux only).
also used for a 2-stage boot -- a kernel based "bootramfs" bootloader
is used to mount, find, and kexec the real kernel within a snapshot,
since standard bootloaders can't see inside subvols yet.
/HEAD
a symlink to a symlink.  HEAD points to the active ref (or directly to
a subvol, the git equivalent of a "detached head"), which points to a
particular subvol.  at at given time, when the system is running, HEAD
will _always_ point to the current subvol in use.
/pool
symlinks to ourself (SELF -- the active btrfs pool), and any others
(FREE will be used in the future if available to "steal" devices; this
will enable hot spares and automatic array repair)
/refs
a hierarchy of symlinks into the /vols directory.  for every subvol
the user has, a symlink in here will exist.  there will also be some
system managed ones (such as "log"... which is autosnap on reboot, if
enabled).  ORIG=snapshot after install, PRI=primary system root,
SEC=the previous system root.  user can manage these with the upcoming
btrfsadm tool.
/vols
all the actual subvols.  named by id.  the above `tree` shows a
"detached boot" state... where boot is outside the fs.  this setup
enables extlinux (and others potentially) to perform kernel level
rollbacks without the use of a 2-stage boot process, but requires
/boot (from within the system) to be a symlink:
# mount
...
/dev/sda on /var/lib/btrfsadm type btrfs (rw,noatime,subvolid=0)
...
# ls -l /boot
lrwxrwxrwx 1 root root 26 Nov 29 03:11 /boot ->
var/lib/btrfsadm/HEAD/boot
this way, mkinitcpio and friends work, and copy the kernel to the
proper detached boot by dereferencing  HEAD.  also, since extlinux can
follow symlinks, simply pointing to HEAD or other refs in
extlinux.conf works (must be under 255 chars). ultimately this is a
workaround for bootloaders unable to handle btrfs or btrfs subvols,
but it works very well, and is easy to move to an "inclusive boot"
later on when bootloader support is better.
o_O O_o

so, to paraphrase: in your hook, you build this kind of tree structure
based on the btrfs devices you find (/pool) and subvolumes (/vols), and
create some symlinks to organize everything (/refs, /HEAD); the idea
being that this will make things more simple during the hook processing.
is this structure in memory only during execution of your hook, or does
it all get written to disk (the btrfs root?) so that the real booted
system will see it also?

right?
...
---------------------------------------------------------------------------------
i know that's a lot of information, and probably more than needed, but
i've been meaning to write it down anyway :-)
let me know how you think that could jive with AIF.
Well, the /var/lib/btrfsadm tree you described seems fairly
non-standard, but you seem to know what you're doing.
If I understood correctly I don't need to worry about
the /var/lib/btrfsadm tree, right?
So you can do your thing and I'll do mine, making sure to strongly
recommend users to put all btrfs mountpoints in separate subvolumes.
...
C Anthony
Dieter