[arch-releng] btrfs support in aif

Tue Dec 14 04:18:13 EST 2010

On Mon, 13 Dec 2010 19:02:17 -0600
C Anthony Risinger <anthony at extof.me> wrote:

> On Sun, Dec 12, 2010 at 9:29 AM, Dieter Plaetinck
> <dieter at plaetinck.be> wrote:
> > Anthony and other interested folks,
> > I've been looking a bit further, and it seems like btrfs support
> > shouldn't be too hard to implement.  It actually seems simpler then
> > LVM (because lvm has 3 levels: PV,VG and LV; btrfs has just the
> > btrfs itself (~default subvolume) and other subvolumes)
> > subvolumes don't get a new devicefile but i'll probably use
> > something like:
> > /dev/sda:$spec
> > to denote what's what.
> > if $spec is a number, it will become mount option subvolid=$spec;
> > otherwise subvol=$spec
> > although from what i can tell, the id's aren't used often. and it
> > seems more robust to me use names anyway.
> 
> yeah, that would work fine; should be simpler than LVM.  the problem
> with mounting by name is that it only works when "name" is in the
> btrfs root (the real root, subvolid=0). ie:
> 
> /subvol
> 
> ... works, but not ...
> 
> /nested/subvol

is this the only limitation with named subvols?
i'm probably missing something, but since the real btrfs is basically
used as a container which will contain subvolumes which are the
ones you will mount in your filesystem (in arbitrary places), why does
it matter where in the btrfs tree the subvolumes are defined? what's
wrong with putting subvolumes in the btrfs root?

it's too bad separate subvolumes don't get their own devicefiles (and
hence, associated /dev/by-uuid/ and /dev/by-label/ symlinks) that
would make my life easier.
are the id's "stable"? i.e, suppose i initialise variable
last_id=0
for every subvolume the user wants to create, can i just do
id=last_id+1 and assume that id will stay the same for this volume?
(at least for the duration of the installation, i.e. not worrying about
snapshot rollbacks etc) (and apparently, skip id 5 because that's
already taken)

> the hook i'm soon to release doesn't support names; it's just too
> inflexible.  btw, for clarity to anyone else, the default subvol is
> not the same as the btrfs root (though initially they are the same).
> default subvol is any subvol marked as the _mount_ default (and later
> mountable via `subvol=.` or none at all)... the real root will always
> be subvolid = 0 or 5.

subvolid 5 ?

> > * which are the requirements your btrfs_advanced mkinitcpio hook
> >  implies?  what things does aif need to do other then just doing
> >  mkfs.btrfs to get the full potential out of btrfs/your hook?
> >  please explain why a default btrfs configuration does not suffice.
> >  does it have something to do with
> > https://btrfs.wiki.kernel.org/index.php/UseCases#Can_a_snapshot_be_replaced_atomically_with_another_snapshot.3F ?
> 
> it's sort of related to that i think.  the reeeeeaaaaalllly messy part
> is what to do when a user has installed the system into the btrfs
> root, instead of a dedicated subvol.  the issue is the btrfs root is
> not movable/editable/replaceable; all other subvols can be
> moved/renamed/deleted/etc... except the root.  thus, there is no clean
> way to programatically "move" the system (in preparation for
> rollback/manage snapshots/etc.).  everything in / must be rm -rf'ed
> manually or it will ultimately become dead space.  i've brought this
> up probably 5 different times to the list be never get any response
> :-(
> 
> the hook (and other impls i'd assume) use the btrfs root for volume
> management, the "sub-root".  the actual "system root" is just one of
> many subvols in the pool, and may change between reboots.  at the very
> least, if AIF created a subvol, marked as default, and installed into
> that subvol, my hook could then safely "rotate" the user into a more
> advanced configuration...

should i give it a specific name? or just a subvol marked as default?
what kind of advanced config do you mean? any stuff that makes more
sense to be done during the installation step? or does it become too
specific to your hook?

> i just need the system in a subvol.  the
> only difference user sees by this procedure (dedicated subvol by
> default) is a "mysterious" directory when they run "btrfs subvolume
> list" that doesn't seem to exist :-) because it's actually underneath
> their /.
> 
> but really, under no cases do i think the system should be installed
> into the btrfs root, i wouldn't even offer it at install time.  if use
> wants that they can do it themselves... they will be happy it's in a
> subvol.

okay, fair enough. i will make it so that you can't choose a mointpoint
for the actual btrfs, only for subvolumes.  if the btrfs guys ever make
things more flexible, it's fairly trivial for us to adapt aif/your hook.

> 
> > * I've read a bit more about btrfs and I think an implementation
> > like this will suffice for most users:
> >  - allow creation of a btrfs on top of 1-n blockdevices (user can
> >    pick raid levels for data and metadata)
> >  - allow creation of 0-m subvolumes
> >  - each subvolume as well as the default can get an arbitrary
> >    mountpoint, as well as specific mount options like
> >    compress, ssd, etc.  if i understood correctly, that is.
> 
> yup, i thinks that's everything for now!  ssd should enable
> automatically when btrfs detects non rotating media.  and ssd_spread
> is for cheaper flash i believe... i forget what the reason was.
> compress we should be sure to note the CPU overhead of zlib (though
> LZO patches will be in next kernel i believe, exciting), though for
> many systems it may not matter.

okay, but as per your advice in the previous paragraph, users won't be
able to select a mointpoint for the btrfs itself, only the subvolumes.
(or maybe I'll just "discourage" them with a warning message)

> >  However, to be fully compatible with your hook, I will probably
> >  "strongly recommend" to create a subvolume __active and mount that
> >  as / Right? anything I missed?
> 
> in the newer setup __active isn't used anymore; i don't intend to
> develop on that configuration anymore, and will phase anyone out in
> favor of this upcoming release.  the new structure looks like this:
> 
> ---------------------------------------------------------------------------------
> 
> /var/lib/btrfsadm
> |-- boot
> |   |-- extlinux.conf
> |   `-- vesamenu.c32
> |-- HEAD -> refs/rw/PRI
> |-- pool
> |   |-- FREE -> /dev/disk/by-label/btrfs-pool-free
> |   `-- SELF -> /dev/disk/by-label/btrfs-pool-self
> |-- refs
> |   |-- ro
> |   |   |-- log
> |   |   |   |-- 1291021356 -> ../../../vols/260
> |   |   |   |-- 1291056164 -> ../../../vols/261
> |   |   |   `-- 1291102035 -> ../../../vols/262
> |   |   `-- usr
> |   |       `-- ORIG -> ../../../vols/260
> |   `-- rw
> |       |-- PRI -> ../../vols/262
> |       |-- SEC -> ../../vols/261
> |       `-- usr
> `-- vols
>     |-- 260
>     |   |-- boot
>     |   |   |-- kernel26-fallback.img
>     |   |   |-- kernel26.img
>     |   |   |-- System.map26
>     |   |   `-- vmlinuz26
>     |   `-- fs (THIS IS A SUBVOL)
>     |-- 261
>     |   |-- boot
>     |   |   |-- kernel26-fallback.img
>     |   |   |-- kernel26.img
>     |   |   |-- System.map26
>     |   |   `-- vmlinuz26
>     |   `-- fs (THIS IS A SUBVOL)
>     `-- 262
>         |-- boot
>         |   |-- kernel26-fallback.img
>         |   |-- kernel26.img
>         |   |-- kxloader.img
>         |   |-- System.map26
>         |   `-- vmlinuz26
>         `-- fs (THIS IS THE ACTIVE SYSTEM ROOT)
> 
> ---------------------------------------------------------------------------------
> 
> so... while much more involved, it's still is very simple and 1000x
> more flexible.  heavily inspired by the .git directory setup.
> 
> a quick breakdown:
> 
> /boot
> this is the real boot device; can be a separate partition/disk,
> multiple disks, or on the same btrfs FS (currently extlinux only).
> also used for a 2-stage boot -- a kernel based "bootramfs" bootloader
> is used to mount, find, and kexec the real kernel within a snapshot,
> since standard bootloaders can't see inside subvols yet.
> 
> /HEAD
> a symlink to a symlink.  HEAD points to the active ref (or directly to
> a subvol, the git equivalent of a "detached head"), which points to a
> particular subvol.  at at given time, when the system is running, HEAD
> will _always_ point to the current subvol in use.
> 
> /pool
> symlinks to ourself (SELF -- the active btrfs pool), and any others
> (FREE will be used in the future if available to "steal" devices; this
> will enable hot spares and automatic array repair)
> 
> /refs
> a hierarchy of symlinks into the /vols directory.  for every subvol
> the user has, a symlink in here will exist.  there will also be some
> system managed ones (such as "log"... which is autosnap on reboot, if
> enabled).  ORIG=snapshot after install, PRI=primary system root,
> SEC=the previous system root.  user can manage these with the upcoming
> btrfsadm tool.
> 
> /vols
> all the actual subvols.  named by id.  the above `tree` shows a
> "detached boot" state... where boot is outside the fs.  this setup
> enables extlinux (and others potentially) to perform kernel level
> rollbacks without the use of a 2-stage boot process, but requires
> /boot (from within the system) to be a symlink:
> 
> # mount
> ...
> /dev/sda on /var/lib/btrfsadm type btrfs (rw,noatime,subvolid=0)
> ...
> 
> # ls -l /boot
> lrwxrwxrwx 1 root root 26 Nov 29 03:11 /boot ->
> var/lib/btrfsadm/HEAD/boot
> 
> this way, mkinitcpio and friends work, and copy the kernel to the
> proper detached boot by dereferencing  HEAD.  also, since extlinux can
> follow symlinks, simply pointing to HEAD or other refs in
> extlinux.conf works (must be under 255 chars). ultimately this is a
> workaround for bootloaders unable to handle btrfs or btrfs subvols,
> but it works very well, and is easy to move to an "inclusive boot"
> later on when bootloader support is better.

o_O O_o

so, to paraphrase: in your hook, you build this kind of tree structure
based on the btrfs devices you find (/pool) and subvolumes (/vols), and
create some symlinks to organize everything (/refs, /HEAD); the idea
being that this will make things more simple during the hook processing.
is this structure in memory only during execution of your hook, or does
it all get written to disk (the btrfs root?) so that the real booted
system will see it also?

right?

> ---------------------------------------------------------------------------------
> 
> i know that's a lot of information, and probably more than needed, but
> i've been meaning to write it down anyway :-)
> 
> let me know how you think that could jive with AIF.

Well, the /var/lib/btrfsadm tree you described seems fairly
non-standard, but you seem to know what you're doing.
If I understood correctly I don't need to worry about
the /var/lib/btrfsadm tree, right?
So you can do your thing and I'll do mine, making sure to strongly
recommend users to put all btrfs mountpoints in separate subvolumes.

> C Anthony

Dieter