[arch-dev-public] [objectives] Package triage on current + extra (archstats)
I know we weren't supposed to rush into the objectives here before settling on the goals, but I think this is an important one to examine, especially with a change in the repository structure happening in the coming months. How many of the devs are using archstats, let alone users? After looking at it a bit more today, I realized this could be a great asset to finding areas where devs should be spending their time with regard to package maintenance. It would also help us out greatly when it comes to determining which packages should no longer be maintained by us and dropped back to the TU level and/or unsupported. However, a few things need work: 1. The current website <http://www.archlinux.org/~simo/archstats/index.php> is in dire need of an overhaul. Using the old website theme is the least of our worries- things like the package listing are at this point rather unusable and only suck 223 MB of memory in Firefox once fully loaded. I have a lot of ideas for this here- enable breakdown by pkgname only (so it actually looks like people are running kernel26), limit number of results on the page unless someone actually selects to see them all, etc. 2. The archstats database. It contains several one-time system updates, and several systems that haven't updated since 2005 or earlier. This is clearly junk data, and to make archstats useful we should probably just start fresh, and find a way to cut down on spurious commits, which leads into... 3. The archstats program itself. In the last week, I've had no problems with it, but have had problems before (some of the spurious commits above were definitely my fault). Configuration should probably be editable in a conf file (the current /etc file has a big fat warning saying do not edit by hand- this seems not the Arch Way). Setting it up as a cron job is straightforward, but if we want people to use it we should probably think of a way to make it even easier. Comments on any of this? I know its yet another project idea to be thrown out there, but this one could prove very helpful and in the long run vastly reduce dev maintenance of package when we realize very few users are actually using a package and it should be maintained elsewhere. -Dan
On Sun, May 06, 2007 at 08:28:53PM -0400, Dan McGee wrote:
I know we weren't supposed to rush into the objectives here before settling on the goals, but I think this is an important one to examine, especially with a change in the repository structure happening in the coming months.
How many of the devs are using archstats, let alone users? After looking at it a bit more today, I realized this could be a great asset to finding areas where devs should be spending their time with regard to package maintenance. It would also help us out greatly when it comes to determining which packages should no longer be maintained by us and dropped back to the TU level and/or unsupported.
However, a few things need work: 1. The current website <http://www.archlinux.org/~simo/archstats/index.php> is in dire need of an overhaul. Using the old website theme is the least of our worries- things like the package listing are at this point rather unusable and only suck 223 MB of memory in Firefox once fully loaded. I have a lot of ideas for this here- enable breakdown by pkgname only (so it actually looks like people are running kernel26), limit number of results on the page unless someone actually selects to see them all, etc. 2. The archstats database. It contains several one-time system updates, and several systems that haven't updated since 2005 or earlier. This is clearly junk data, and to make archstats useful we should probably just start fresh, and find a way to cut down on spurious commits, which leads into... 3. The archstats program itself. In the last week, I've had no problems with it, but have had problems before (some of the spurious commits above were definitely my fault). Configuration should probably be editable in a conf file (the current /etc file has a big fat warning saying do not edit by hand- this seems not the Arch Way). Setting it up as a cron job is straightforward, but if we want people to use it we should probably think of a way to make it even easier.
Comments on any of this? I know its yet another project idea to be thrown out there, but this one could prove very helpful and in the long run vastly reduce dev maintenance of package when we realize very few users are actually using a package and it should be maintained elsewhere.
Actually I have an entirely rewritten archstats laying on my harddrive, all it needs is someone to slap a pretty web interface on it. It's probably not very well done, it was me playing with django a bit, but nonetheless it didn't take very long to put together. As for ideas, what I've been wanting to do is have a way for pacman to have some form of "hooks". Something where a command gets run after an Syu or an S or an R, in this case, that command would be whatever is required for archstats to update the package list. This would make it really easy for people to just "set and forget" archstats, and we could get very good and up to date stats that way. Also, I sort of inherited the archstats project from eric when he left, and I haven't really touched it at all, besides culling some old data once in a while (havent done that in a few months though). It's another one of those projects I said I'd work on but haven't gotten around to... I won't bother making promises I might not keep but school does end soon, I've got nothing but free time in a few weeks... hopefully I can get my butt in gear and at least get the ball rolling or something. -S
On 5/6/07, Simo Leone <simo@archlinux.org> wrote:
On Sun, May 06, 2007 at 08:28:53PM -0400, Dan McGee wrote:
I know we weren't supposed to rush into the objectives here before settling on the goals, but I think this is an important one to examine, especially with a change in the repository structure happening in the coming months.
How many of the devs are using archstats, let alone users? After looking at it a bit more today, I realized this could be a great asset to finding areas where devs should be spending their time with regard to package maintenance. It would also help us out greatly when it comes to determining which packages should no longer be maintained by us and dropped back to the TU level and/or unsupported.
However, a few things need work: 1. The current website <http://www.archlinux.org/~simo/archstats/index.php> is in dire need of an overhaul. Using the old website theme is the least of our worries- things like the package listing are at this point rather unusable and only suck 223 MB of memory in Firefox once fully loaded. I have a lot of ideas for this here- enable breakdown by pkgname only (so it actually looks like people are running kernel26), limit number of results on the page unless someone actually selects to see them all, etc. 2. The archstats database. It contains several one-time system updates, and several systems that haven't updated since 2005 or earlier. This is clearly junk data, and to make archstats useful we should probably just start fresh, and find a way to cut down on spurious commits, which leads into... 3. The archstats program itself. In the last week, I've had no problems with it, but have had problems before (some of the spurious commits above were definitely my fault). Configuration should probably be editable in a conf file (the current /etc file has a big fat warning saying do not edit by hand- this seems not the Arch Way). Setting it up as a cron job is straightforward, but if we want people to use it we should probably think of a way to make it even easier.
Comments on any of this? I know its yet another project idea to be thrown out there, but this one could prove very helpful and in the long run vastly reduce dev maintenance of package when we realize very few users are actually using a package and it should be maintained elsewhere.
Actually I have an entirely rewritten archstats laying on my harddrive, all it needs is someone to slap a pretty web interface on it. It's probably not very well done, it was me playing with django a bit, but nonetheless it didn't take very long to put together.
As for ideas, what I've been wanting to do is have a way for pacman to have some form of "hooks". Something where a command gets run after an Syu or an S or an R, in this case, that command would be whatever is required for archstats to update the package list. This would make it really easy for people to just "set and forget" archstats, and we could get very good and up to date stats that way.
Also, I sort of inherited the archstats project from eric when he left, and I haven't really touched it at all, besides culling some old data once in a while (havent done that in a few months though).
It's another one of those projects I said I'd work on but haven't gotten around to... I won't bother making promises I might not keep but school does end soon, I've got nothing but free time in a few weeks... hopefully I can get my butt in gear and at least get the ball rolling or something.
Simo already knows this, but here is what I started last night while procrastinating my exam studying: <http://code.toofishes.net/gitweb.cgi?p=archstats.git;a=summary> I've managed to remove about 500 lines of code by moving repetition to functions (still more to be done, but that was 1/6 of the code). I also completely bypassed the MD5sum checking stuff, showing how that is worthless. Simo and I were trying to think of a better way to do client verification (Jürgen, any ideas?), and we came up with nothing. Obviously we could get poisoned data using archstats, but at the same time, some data seems better than no data especially if we can patrol it. -Dan
On Mon, May 07, 2007 at 12:13:16PM -0400, Dan McGee wrote:
I've managed to remove about 500 lines of code by moving repetition to functions (still more to be done, but that was 1/6 of the code). I also completely bypassed the MD5sum checking stuff, showing how that is worthless. Simo and I were trying to think of a better way to do client verification (Jürgen, any ideas?), and we came up with nothing.
There is no solution, if users are anonymous. A simple workaround/hack: Prevent connects from the same IP (for a limited time period). This could limit the possibility to flood the database with multiple machine entries from one user. Jürgen
On 5/7/07, Jürgen Hötzel <juergen@hoetzel.info> wrote:
On Mon, May 07, 2007 at 12:13:16PM -0400, Dan McGee wrote:
I've managed to remove about 500 lines of code by moving repetition to functions (still more to be done, but that was 1/6 of the code). I also completely bypassed the MD5sum checking stuff, showing how that is worthless. Simo and I were trying to think of a better way to do client verification (Jürgen, any ideas?), and we came up with nothing.
There is no solution, if users are anonymous. A simple workaround/hack:
Prevent connects from the same IP (for a limited time period).
This could limit the possibility to flood the database with multiple machine entries from one user.
I thought about this solution as well, but I realized it does carry with it a rather large negative. If a user has 4 Arch boxes behind a router with NAT, and they all run archstats as a cronjob at the same time, we would be excluding all but 1 of his boxes from updates. How essential is user anonymity on submission? Would users feel comfortable registering (which is a hurdle I think we should try to avoid) if their anonymous state was still preserved in any data presented to the user? -Dan
2007/5/7, Dan McGee <dpmcgee@gmail.com>:
On 5/7/07, Jürgen Hötzel <juergen@hoetzel.info> wrote:
On Mon, May 07, 2007 at 12:13:16PM -0400, Dan McGee wrote:
I've managed to remove about 500 lines of code by moving repetition to functions (still more to be done, but that was 1/6 of the code). I also completely bypassed the MD5sum checking stuff, showing how that is worthless. Simo and I were trying to think of a better way to do client verification (Jürgen, any ideas?), and we came up with nothing.
There is no solution, if users are anonymous. A simple workaround/hack:
Prevent connects from the same IP (for a limited time period).
This could limit the possibility to flood the database with multiple machine entries from one user.
I thought about this solution as well, but I realized it does carry with it a rather large negative. If a user has 4 Arch boxes behind a router with NAT, and they all run archstats as a cronjob at the same time, we would be excluding all but 1 of his boxes from updates.
How essential is user anonymity on submission? Would users feel comfortable registering (which is a hurdle I think we should try to avoid) if their anonymous state was still preserved in any data presented to the user?
Also, users connected to large ISPs have dynamic IPs (at least this is common case in Eastern Europe). -- Roman Kyrylych (Роман Кирилич)
On 5/7/07, Roman Kyrylych <roman.kyrylych@gmail.com> wrote:
2007/5/7, Dan McGee <dpmcgee@gmail.com>:
On 5/7/07, Jürgen Hötzel <juergen@hoetzel.info> wrote:
There is no solution, if users are anonymous. A simple workaround/hack:
Prevent connects from the same IP (for a limited time period).
This could limit the possibility to flood the database with multiple machine entries from one user.
I thought about this solution as well, but I realized it does carry with it a rather large negative. If a user has 4 Arch boxes behind a router with NAT, and they all run archstats as a cronjob at the same time, we would be excluding all but 1 of his boxes from updates.
How essential is user anonymity on submission? Would users feel comfortable registering (which is a hurdle I think we should try to avoid) if their anonymous state was still preserved in any data presented to the user?
Also, users connected to large ISPs have dynamic IPs (at least this is common case in Eastern Europe).
That isn't the problem here Roman- that will work just fine. We aren't planning on doing a one-to-one pairing of users with IP addresses. He only proposed that we limit the connections from one IP address to one update an hour, for example, to prevent mass floods of the database. -Dan
On 5/7/07, Dan McGee <dpmcgee@gmail.com> wrote:
On 5/7/07, Jürgen Hötzel <juergen@hoetzel.info> wrote:
On Mon, May 07, 2007 at 12:13:16PM -0400, Dan McGee wrote:
I've managed to remove about 500 lines of code by moving repetition to functions (still more to be done, but that was 1/6 of the code). I also completely bypassed the MD5sum checking stuff, showing how that is worthless. Simo and I were trying to think of a better way to do client verification (Jürgen, any ideas?), and we came up with nothing.
There is no solution, if users are anonymous. A simple workaround/hack:
Prevent connects from the same IP (for a limited time period).
This could limit the possibility to flood the database with multiple machine entries from one user.
I thought about this solution as well, but I realized it does carry with it a rather large negative. If a user has 4 Arch boxes behind a router with NAT, and they all run archstats as a cronjob at the same time, we would be excluding all but 1 of his boxes from updates.
How essential is user anonymity on submission? Would users feel comfortable registering (which is a hurdle I think we should try to avoid) if their anonymous state was still preserved in any data presented to the user?
One thing cactus brought up recently is user ldap support. Suppsoedly mediawiki and punbb support ldap login. If were were to use ldap for user accounts, we could spread that to all aspects, such as the AUR and archstats. *Then* we could implement the "only one submission from this user every X minutes" setup.
On 5/7/07, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
On 5/7/07, Dan McGee <dpmcgee@gmail.com> wrote:
On 5/7/07, Jürgen Hötzel <juergen@hoetzel.info> wrote:
On Mon, May 07, 2007 at 12:13:16PM -0400, Dan McGee wrote:
I've managed to remove about 500 lines of code by moving repetition to functions (still more to be done, but that was 1/6 of the code). I also completely bypassed the MD5sum checking stuff, showing how that is worthless. Simo and I were trying to think of a better way to do client verification (Jürgen, any ideas?), and we came up with nothing.
There is no solution, if users are anonymous. A simple workaround/hack:
Prevent connects from the same IP (for a limited time period).
This could limit the possibility to flood the database with multiple machine entries from one user.
I thought about this solution as well, but I realized it does carry with it a rather large negative. If a user has 4 Arch boxes behind a router with NAT, and they all run archstats as a cronjob at the same time, we would be excluding all but 1 of his boxes from updates.
How essential is user anonymity on submission? Would users feel comfortable registering (which is a hurdle I think we should try to avoid) if their anonymous state was still preserved in any data presented to the user?
One thing cactus brought up recently is user ldap support. Suppsoedly mediawiki and punbb support ldap login. If were were to use ldap for user accounts, we could spread that to all aspects, such as the AUR and archstats.
*Then* we could implement the "only one submission from this user every X minutes" setup.
I like this idea. I don't think most people would object to having to "register" if it was a "register once for everything" type of thing. In addition, this would keep out junk data (people without accounts) and misconfigured clients (sending every 5 minutes). However, some (many?) users have multiple computers but that should be a minor implementation detail to work with later. Looks like Flyspray doesn't and won't have it though: http://bugs.flyspray.org/task/233 http://bugs.flyspray.org/task/758 -Dan
On Sun, May 06, 2007 at 08:28:53PM -0400, Dan McGee wrote:
How many of the devs are using archstats, let alone users? After looking at it a bit more today, I realized this could be a great asset to finding areas where devs should be spending their time with regard to package maintenance. It would also help us out greatly when it comes to determining which packages should no longer be maintained by us and dropped back to the TU level and/or unsupported.
I would not rely arch-stats. Data can be easily spoofed by malicious clients. Jürgen
participants (5)
-
Aaron Griffin
-
Dan McGee
-
Jürgen Hötzel
-
Roman Kyrylych
-
Simo Leone