[aur-dev] [PATCH 3/3] Segment the upload directory by package name prefix

Lukas Fleischer archlinux at cryptocrack.de
Fri Jul 29 19:55:36 EDT 2011


On Fri, Jul 29, 2011 at 03:50:44PM -0500, Dan McGee wrote:
> On Fri, Jul 29, 2011 at 3:32 PM, Lukas Fleischer
> <archlinux at cryptocrack.de> wrote:
> > On Thu, Jul 28, 2011 at 01:59:07PM -0500, Dan McGee wrote:
> >> This implements the following scheme:
> >>
> >> * /packages/cower/ --> /packages/co/cower/
> >> * /packages/j/     --> /packages/j/j/
> >> * /packages/zqy/   --> /packages/zq/y/
> >
> > I hope there's a typo in the last example, otherwise I must have
> > misunderstood something :)
> Yes, typo.

You might want to amend this when resubmitting the patch (details
follow) :)

> 
> >>
> >> We take up to the first two characters of each package name as a
> >> intermediate subdirectory, and then the full package name lives
> >> underneath that.
> >>
> >> Why, you ask? Well because earlier today the AUR hit 32,000 entries in
> >> the unsupported/ directory, making new package uploads impossible. While
> >> some might argue we shouldn't have so many damn packages in the repos,
> >> we should be able to handle this case.
> >>
> >> Why two characters instead of one? Our two biggest two-char groups, 'pe'
> >> and 'py', both start with 'p', and have nearly 2000 packages each. Go
> >> Python and Perl.
> >
> > Time to move to ext4, eh? No, seriously: Something tells me that we
> > should neither be filesystem dependant nor depend on the current
> > distribution of package names. Using some better hash algorithm might
> > fix the second problem while reducing predictability for the end user.
> We wouldn't be the first to use a scheme like this, so it felt like
> the right choice. See for example:
> http://pypi.python.org/packages/source/D/Django/Django-1.3.tar.gz (one
> letter only)
> http://search.cpan.org/CPAN/authors/id/S/SR/SRI/Mojolicious-1.68.tar.gz
> (segmented by author, multiple levels)

Yeah, I've seen this before.

> 
> Ext4 came up as an option yesterday, but its best to prevent one from
> shooting themselves in the foot like this, and this seems like the
> more proper solution. I do share some thoughts that we shouldn't
> depend on a certain distribution of package names, but reducing
> predictability seemed like a big enough downfall that I didn't want to
> go that way, and moving to this scheme greatly increases the time
> before we'd hit any problems. If anyone has predictable but scalable
> solutions I'm more than open to hearing them. Of course, we do provide
> the URL in the JSON request for a reason.

Well, one predictable and scalable solution would be to split after
every character or after every two characters and create nested
directories (as we only allow a subset of all possible file names, that
would still result in less than ~32000 subdirectories per directory).
This would result in a very inscrutable directory structure tho.

As I mentioned before, I'm fine with using the 2-character prefix for
now. It feels like the best compromise. We can think about more
individualized solutions later.

> 
> > Given that we will probably run into the same again soon (according to
> > current statistics, there are about two months left), I will apply this
> > temporary workaround and prepare a release soon, though.
> Yeah, we were able to free ~1000 "spots" or so, so we have breathing
> room, but not loads of time. I did this via the cleanup script if that
> wasn't obvious, realize I didn't really say that anywhere.

That was kind of obvious, yeah :) Especially since you submitted the
cleanup script patch as well.

> 
> >> Still needed is a "move the existing data" script, as well as a set of
> >> rewrite rules for those wishing to preserve backward compatible URLs for
> >> any helper programs doing the wrong thing and relying on them.
> >
> > If we provide backward compatible URLs, why not keep them as default? I
> > doubt this will affect performance...
> It wouldn't- I was more concerned with keeping the AUR deployable as
> easily as possible, and not tied to a specific webserver and it's
> hairy configuration of rewrite rules. Making them optional prevents us
> from having to muck with things at that level.

Ack.

> 
> >> Signed-off-by: Dan McGee <dan at archlinux.org>
> >> ---
> >>  scripts/cleanup              |   24 ++++++++++++++++--------
> >>  web/html/pkgsubmit.php       |    2 +-
> >>  web/lib/aurjson.class.php    |    2 +-
> >>  web/template/pkg_details.php |    2 +-
> >>  4 files changed, 19 insertions(+), 11 deletions(-)
> >>


More information about the aur-dev mailing list