Re: [aur-dev] Making the AUR package list more useful

30 Apr 2016

      On Sat, Apr 30, 2016 at 12:08:07AM +0200, Lukas Fleischer wrote:
...
On Fri, 29 Apr 2016 at 13:56:58, Dave Reisner wrote:
...
What are your reasons for not wanting to extend the RPC interface?
Splitting the API in this way is cumbersome for clients. We now are
supposed to download and maintain data which is essentially an index (which
ought to be internal to the server to allow strong consistency). You're
also asking clients of a read-only interface to become stateful, which
isn't something that really interests me.
Isn't that exactly the way pacman works, though? It downloads a copy of
the database locally and uses that database to answer requests and
obtain package information. My vision is that, optimally, the official
repositories and the AUR build upon the same basic concept. Apart from
binary vs. source packages, the only real difference between the
official repositories and the AUR is the amount of packages but if we
figure out a good way to solve point (2) from my initial email, that
should be an non-issue.
Hrmm, I don't know that this is an equal comparison. Here's my
perception of the current world:

pacman relies on distribution of the *entire* DB to mirrors around the
world. Due to the tiered mirror system, you can basically only rely on
eventual consistency of tier N>0 with tier 0, but the DBs at any given
point in time should be consistent with themselves (i.e. assuming
they're well-behaved, they won't advertise packages which they don't
have). In addition to the sync tarballs, pacman relies on a local
database which it mutates as packages are installed, upgraded, and
removed.

pacman has reduced functionality when it has no reachable mirror -- it's
still capable of removing packages, modifying the local DB (to adjust
install reasons), and install packages which are present in a file
cache.

In contrast, the AUR currently only offers an API to support adhoc
queries. There are no mirrors, and the RPC interface offers strong
consistency with the contents of the AUR. I think we can agree that in
the current form, packages.gz and pkgbases.gz files aren't very useful
as they tend to lag too far behind reality.

AUR clients currently have a hard dependency on the network. If they
cannot reach the AUR, they cannot do anything useful.

Your proposal to make the pkgname/pkgbase tarballs more closely
consistent doesn't change the network dependency. All it seems to do is
offload the ability to perform more precise searching to the client,
*if* they choose to implement it. I'm suggesting that the server should
do this, such that we have a single implementation which *everyone* can
take advantage of. Not just clients of the RPC interface, but the web UI
as well.

I might go so far as to say that we should try and find ways to *remove*
packages.gz and pkgbases.gz, and devise better solutions to the problems
people want to solve with these files.
...
Apart from that, there are two general directions we can go:
* Do everything on the server. Keep extending the server for every
  feature that is needed by some client. What happens if a user only
  wants to know the number of packages matched by a given expression? Do
  we really want to force her to fetch the whole list of matched
  packages, just to obtain its size, or do we add another request type?
  And even if regular expressions were the last missing thing, adding
  them demands a bit more thought than one might expect (what kind of
  expressions do we support, do we need to care about ReDoS or is that
  handled by the engines themselves, etc.)
Agreed. Regular expressions aren't necessarily what we want to end up
with. As an alternative, prefix and suffix matching would be
substantially cheaper, less prone to abuse/dos, and would probably
fulfill the needs of most people.

If you wanted to offer the ability to return just the size of the
resultset for some advanced search method, you could add another
parameter to the current search interface which would elide the
'results' list in the reponse JSON. You already have a 'resultcount'
field with the size.
...
* Directly publish all the information required to answer all possible
  requests. Let the clients do whatever they want. Currently, we only
  provide package names but in the future, this could be extended to a
  more complete database like the one pacman uses.
This has the same problems as the current gz files -- you can only
offer eventual consistency. It also only scales well if you can
distribute the load in the same way that pacman does with a tiered
mirror system. This comes with a non-zero maintenance cost.
...
I am not saying that the second option is the holy grail. For a simple
web application that retrieves information on a package or for a single
basic package search, downloading everything might be overkill. That is
why I suggest to keep the very basic RPC interface we have right now
and, additionally, provide package databases for fancier applications.
I am not set on this idea yet. It just seems like the most natural and
Arch way of handling this kind of things. I am open for discussion!
...
Or, change the storage for the name list such that updates can be fast.
Turns out, you already have such a thing, you'd just need an index on the
Packages and PackageBases tables.
Those indices are there already. Dropping the package list cache
completely might be an option (got to investigate the performance
impact).
Serving the package list from the index would essentially be a full
table scan -- I don't think it's going to go well.
...
...
I'm not understanding why any of this is considered a good direction for
the API. What are the reasons for wanting the whole list of package names
on the client side? Are there use cases other than search?
Search could be extended in many ways, especially now that we have
useful meta data. One could build full dependency trees of the AUR add
proper support for package groups, just to name two examples.
I agree. These are useful ideas.
...
Regards,
Lukas