[arch-releng] btrfs support in aif

C Anthony Risinger anthony at extof.me
Mon Dec 13 20:02:17 EST 2010


On Sun, Dec 12, 2010 at 9:29 AM, Dieter Plaetinck <dieter at plaetinck.be> wrote:
> Anthony and other interested folks,
> I've been looking a bit further, and it seems like btrfs support
> shouldn't be too hard to implement.  It actually seems simpler then LVM
> (because lvm has 3 levels: PV,VG and LV; btrfs has just the btrfs
> itself (~default subvolume) and other subvolumes)
> subvolumes don't get a new devicefile but i'll probably use something
> like:
> /dev/sda:$spec
> to denote what's what.
> if $spec is a number, it will become mount option subvolid=$spec;
> otherwise subvol=$spec
> although from what i can tell, the id's aren't used often. and it
> seems more robust to me use names anyway.

yeah, that would work fine; should be simpler than LVM.  the problem
with mounting by name is that it only works when "name" is in the
btrfs root (the real root, subvolid=0). ie:

/subvol

... works, but not ...

/nested/subvol

the hook i'm soon to release doesn't support names; it's just too
inflexible.  btw, for clarity to anyone else, the default subvol is
not the same as the btrfs root (though initially they are the same).
default subvol is any subvol marked as the _mount_ default (and later
mountable via `subvol=.` or none at all)... the real root will always
be subvolid = 0 or 5.

> * which are the requirements your btrfs_advanced mkinitcpio hook
>  implies?  what things does aif need to do other then just doing
>  mkfs.btrfs to get the full potential out of btrfs/your hook?
>  please explain why a default btrfs configuration does not suffice.
>  does it have something to do with
> https://btrfs.wiki.kernel.org/index.php/UseCases#Can_a_snapshot_be_replaced_atomically_with_another_snapshot.3F ?

it's sort of related to that i think.  the reeeeeaaaaalllly messy part
is what to do when a user has installed the system into the btrfs
root, instead of a dedicated subvol.  the issue is the btrfs root is
not movable/editable/replaceable; all other subvols can be
moved/renamed/deleted/etc... except the root.  thus, there is no clean
way to programatically "move" the system (in preparation for
rollback/manage snapshots/etc.).  everything in / must be rm -rf'ed
manually or it will ultimately become dead space.  i've brought this
up probably 5 different times to the list be never get any response
:-(

the hook (and other impls i'd assume) use the btrfs root for volume
management, the "sub-root".  the actual "system root" is just one of
many subvols in the pool, and may change between reboots.  at the very
least, if AIF created a subvol, marked as default, and installed into
that subvol, my hook could then safely "rotate" the user into a more
advanced configuration... i just need the system in a subvol.  the
only difference user sees by this procedure (dedicated subvol by
default) is a "mysterious" directory when they run "btrfs subvolume
list" that doesn't seem to exist :-) because it's actually underneath
their /.

but really, under no cases do i think the system should be installed
into the btrfs root, i wouldn't even offer it at install time.  if use
wants that they can do it themselves... they will be happy it's in a
subvol.

> * I've read a bit more about btrfs and I think an implementation like
>  this will suffice for most users:
>  - allow creation of a btrfs on top of 1-n blockdevices (user can
>    pick raid levels for data and metadata)
>  - allow creation of 0-m subvolumes
>  - each subvolume as well as the default can get an arbitrary
>    mountpoint, as well as specific mount options like
>    compress, ssd, etc.  if i understood correctly, that is.

yup, i thinks that's everything for now!  ssd should enable
automatically when btrfs detects non rotating media.  and ssd_spread
is for cheaper flash i believe... i forget what the reason was.
compress we should be sure to note the CPU overhead of zlib (though
LZO patches will be in next kernel i believe, exciting), though for
many systems it may not matter.

>  However, to be fully compatible with your hook, I will probably
>  "strongly recommend" to create a subvolume __active and mount that
>  as / Right? anything I missed?

in the newer setup __active isn't used anymore; i don't intend to
develop on that configuration anymore, and will phase anyone out in
favor of this upcoming release.  the new structure looks like this:

---------------------------------------------------------------------------------

/var/lib/btrfsadm
|-- boot
|   |-- extlinux.conf
|   `-- vesamenu.c32
|-- HEAD -> refs/rw/PRI
|-- pool
|   |-- FREE -> /dev/disk/by-label/btrfs-pool-free
|   `-- SELF -> /dev/disk/by-label/btrfs-pool-self
|-- refs
|   |-- ro
|   |   |-- log
|   |   |   |-- 1291021356 -> ../../../vols/260
|   |   |   |-- 1291056164 -> ../../../vols/261
|   |   |   `-- 1291102035 -> ../../../vols/262
|   |   `-- usr
|   |       `-- ORIG -> ../../../vols/260
|   `-- rw
|       |-- PRI -> ../../vols/262
|       |-- SEC -> ../../vols/261
|       `-- usr
`-- vols
    |-- 260
    |   |-- boot
    |   |   |-- kernel26-fallback.img
    |   |   |-- kernel26.img
    |   |   |-- System.map26
    |   |   `-- vmlinuz26
    |   `-- fs (THIS IS A SUBVOL)
    |-- 261
    |   |-- boot
    |   |   |-- kernel26-fallback.img
    |   |   |-- kernel26.img
    |   |   |-- System.map26
    |   |   `-- vmlinuz26
    |   `-- fs (THIS IS A SUBVOL)
    `-- 262
        |-- boot
        |   |-- kernel26-fallback.img
        |   |-- kernel26.img
        |   |-- kxloader.img
        |   |-- System.map26
        |   `-- vmlinuz26
        `-- fs (THIS IS THE ACTIVE SYSTEM ROOT)

---------------------------------------------------------------------------------

so... while much more involved, it's still is very simple and 1000x
more flexible.  heavily inspired by the .git directory setup.

a quick breakdown:

/boot
this is the real boot device; can be a separate partition/disk,
multiple disks, or on the same btrfs FS (currently extlinux only).
also used for a 2-stage boot -- a kernel based "bootramfs" bootloader
is used to mount, find, and kexec the real kernel within a snapshot,
since standard bootloaders can't see inside subvols yet.

/HEAD
a symlink to a symlink.  HEAD points to the active ref (or directly to
a subvol, the git equivalent of a "detached head"), which points to a
particular subvol.  at at given time, when the system is running, HEAD
will _always_ point to the current subvol in use.

/pool
symlinks to ourself (SELF -- the active btrfs pool), and any others
(FREE will be used in the future if available to "steal" devices; this
will enable hot spares and automatic array repair)

/refs
a hierarchy of symlinks into the /vols directory.  for every subvol
the user has, a symlink in here will exist.  there will also be some
system managed ones (such as "log"... which is autosnap on reboot, if
enabled).  ORIG=snapshot after install, PRI=primary system root,
SEC=the previous system root.  user can manage these with the upcoming
btrfsadm tool.

/vols
all the actual subvols.  named by id.  the above `tree` shows a
"detached boot" state... where boot is outside the fs.  this setup
enables extlinux (and others potentially) to perform kernel level
rollbacks without the use of a 2-stage boot process, but requires
/boot (from within the system) to be a symlink:

# mount
...
/dev/sda on /var/lib/btrfsadm type btrfs (rw,noatime,subvolid=0)
...

# ls -l /boot
lrwxrwxrwx 1 root root 26 Nov 29 03:11 /boot -> var/lib/btrfsadm/HEAD/boot

this way, mkinitcpio and friends work, and copy the kernel to the
proper detached boot by dereferencing  HEAD.  also, since extlinux can
follow symlinks, simply pointing to HEAD or other refs in
extlinux.conf works (must be under 255 chars). ultimately this is a
workaround for bootloaders unable to handle btrfs or btrfs subvols,
but it works very well, and is easy to move to an "inclusive boot"
later on when bootloader support is better.

---------------------------------------------------------------------------------

i know that's a lot of information, and probably more than needed, but
i've been meaning to write it down anyway :-)

let me know how you think that could jive with AIF.

C Anthony


More information about the arch-releng mailing list