[aur-general] Vote - Moving [community] to use same system as main repos

Mon Jan 26 20:00:40 EST 2009

On Mon, Jan 26, 2009 at 4:17 PM, Grigorios Bouzakis <grbzks at gmail.com> wrote:
> On Mon, Jan 26, 2009 at 02:39:38PM -0600, kludge wrote:
>> if there is no voting on packages in [community], then what mechanism
>> exists for users to suggest/cheer on a package's promotion to [extra]?
>>
>> -kludge
>
> One users answer:
>
> Maybe the same mechanism that would move packages out of extra to
> unsupported or community?
> Its called pkgstats. pacman -S pkgstats and then exec pkgstats as root
> IIRC.
>
> Greg

Holy smokes, the current direction things are going in is scaring me.
I don't understand why so much reliance has been put into pkgstats
when there are a lot of fundamental problems with it in its current
form. These are the EXACT same concerns that Bob Finch and I brought
up with the recent vote on community package guidelines, and it seems
everyone has been ignoring those issues and is ready to put all sorts
of trust into pkgstats without examining its pitfalls.

For starters, pkgstats has been around only since the beginning of
November of last year (a little less than three months now). Although
its implementation is more simple by design, we did have ArchStats
before that, and people had NOWHERE near as much faith in that tool's
statistics as they apparently do in pkgstats. Yes ArchStats was more
complicated (and unmaintained), but it also had other features and
protections that pkgstats does not. I agree that simplicity is nice
from a development standpoint, but I'd argue that it also has the
potential to lower the tool's accuracy.

As far as implementation issues, right now, pkgstats does not
accommodate for multiple machines with the same IP address (extremely
common in home environments running NAT). It also cannot track changes
over time or expire old data, which I'd think would be an absolute
necessity if you want want to be able to rely on it for getting an
accurate snapshot of current installation numbers or see to trends
(which seems to be what people want the tool to be used for).

Which brings me to the next set of problems. How do you know the data
is an accurate reflection of what the community wants or uses? Right
now, people are using the terms "popularity", "usage", and "installed"
as the same thing and assuming that is what pkgstats shows them. The
script only reflects what is installed, not what people are using, not
what people are interested in...just installed! What about people who
download things to try them out, but do not remove them? What about
packages that don't have old dependencies removed when they're
updated? There are a bunch of scenario's where long-running systems
might have unnecessary or stale packages installed that the user is
unaware of (see liblbxutil at 67.74%, csup at 42.53%, and xorg-xsm at
34.74% as reported by pkgstats).

Also, pkgstats may be the best thing we currently have, but how do we
know the stats it generates are an accurate representation of Arch
users? Do we know how many Arch users there are in total? We have just
over 20,000 registered forum members (the US forum only), just under
13,000 registered AUR users, and just about 3000 unique IPs that have
contributed to pkgstats. The thing is, who knows how many Arch users
are out there that haven't registered for any of our sites? Or what
about users who have registered for our sites but no longer use Arch?
Or users with multiple machines and the issue of whether or not those
machines should count differently for pkgstats? Who knows how many of
those unique IP addresses are unique machines and not just updates
from the same machine where an ISP has handed out a different IP
address? If there are users out there that are unaware of pkgstats, or
aren't that involved in the community (but still use Arch), then are
the current reports skewed one way or another? Meaning, are the
packages that developers and people active in the community reflect
the packages used by everyone?

Also, when pkgstats was introduced, it was said that "In an ideal
world one would run the script only once per installation or if really
lot of things have changed (not the version of packages).", but if you
want these numbers to be up-to-date, then it SHOULD be run on a
regular basis and the old data should be discarded or weighted
differently. Right now, there have been 3,757 pkgstat submissions,
meaning about 700 people have run it multiple times. If we want people
to keep up that habit and maintain current info, then perhaps it
should come with a script that could be enabled to run it as a cron
job...even if it does not get enable by default.

Not only that, but pkgstats reports that no package has a 100%
installation percentage...even pkgstats itself only has 98.88%, which
means people have stuff installed on their systems that have not been
reported. In this particular case, it's probably people that have
downloaded the script directly rather than installed it via pacman,
but it's still an issue to consider with other packages as well.

Because pkgstats is not an "official" tool that is distributed with
the core and turned on by default (which I don't think it should be),
that alone means it has some amount of bias built into it. Not knowing
the answers to a lot of these questions means that it's incredibly
hard to tell statistically if the numbers reported by pkgstats are
accurate or should be relied upon, and yet that's what people are
doing.

My whole point in bringing all this up is that it seems like people
are treating pkgstats as a panacea, when I don't think it should be
the sole basis of all of these recent decisions. I absolutely support
the idea of pkgstats, but think that its still in its infancy and
should not be so blindly trusted. People have been suggesting
arbitrary limits and restrictions based on its numbers, but the
numbers themselves could be pretty far off base...and I think that's
the bigger problem.

--
Aaron "ElasticDog" Schaefer