[pacman-dev] Repo database(s) layout

Tue Nov 5 05:58:07 EST 2013

On 05/11/13 13:58, Dan McGee wrote:
> On Mon, Nov 4, 2013 at 7:23 PM, Allan McRae <allan at archlinux.org> wrote:
> 
>> Hi,
>>
>> We currently have a .db and .files databases, with .files being a
>> superset of .db.
>>
>> An idea was formed on IRC to completely separate these.   I.e. .db stays
>> as it is and .files only includes the file lists.   We would then add
>> .source to include the source package information.   I would set
>> repo-add to automatically create all these files.
>>
>> We would then add something like "-S --refresh-files" and "-S
>> --refresh-source" to download those files as a one off, printing a
>> warning when using them if they are out of date compared to the repo.
>> Another option is to use Usage as a flag for when to download them, but
>> refreshing all those every update seems excessive.
>>
>> This would also allow us to have some basic pkgfile functionality in
>> pacman (-So).
>>
>> So, there much to work out, but does the general idea sound good to people?
>>
>>
>> No, this sounds like a step backwards to me, so -1 (multiplied by as many
> times as I'm allowed to vote -1).
> 
> For a while, repo-add didn't know how to create .files databases. This was
> added in January 2011:
> https://projects.archlinux.org/pacman.git/commit/scripts/repo-add.sh.in?id=eda4d9ec00be1108ab4336a438299a283c5a0a90
> 
> That allowed us to commit a large change to the way dbscripts generated
> these package files (which was error-prone, slow, and they were not
> immediately up-to-date like they are now):
> https://projects.archlinux.org/dbscripts.git/commit/?id=fc6a6ab07bde03c7f20d5a4ed971f8e699ee9b20
> 
> Why did I start down this road? Because it was absolutely impossible to get
> consistent, "transactional", database data in any way shape or form that
> didn't require 82 special cases in Archweb to handle parsing and loading
> the data into a database. Once I open a .files database file, I know I
> don't need anything else to have a consistent view of that database. As
> soon as we have to pry into two different files, things were an absolute
> mess, and one has to cross-reference two different files, guess and pray
> that the architectures are actually correct on the files data (because
> there is no way to tell if you don't have the other data, keep this in
> mind), and have no real way of telling which database file lags the other.
> 
> I'm not sure what the rationale is for removing the non-files data from the
> files databases. Does it make them notably larger or slower to process?
> 

The non-files data makes up ~5% of the files database.

But I am not understanding your argument against this.  My idea is to
have repo-add ALWAYS create a .db and .files databases instead of having
to run repo-add twice to generate the separate files.  In that case I
find it redundant to have the .db information within .files database.
But I really want to implement repo-add generating/updating both the .db
and .files databases in a single call regardless of what information
stays in the .files database.

I suppose this comes down to the following questions.  Where should the
source package information go?  The .db file?  At a rough guess, the PGP
signature for the source package would increase the repo database by an
extra 30-40%.  So perhaps a separate .source db?  If separate, what
information should go there?  And should there be a type of database
containing ALL information?

Allan