[arch-dev-public] pkgstats: second try
Hi all, two years ago (http://www.archlinux.org/news/419/) I created a stupid script to get some stats about package usage from our users. I did some improvements. The client now submits the mirror used by pacman. On the server side statistics about the country (looked up by geoip) are stored and more important all data are stored with a timestamp. This way we should be able to see usage trends etc.. It would also be possible to have users submit data regularly without messing up all stats. For now please have a look at the client itself. Check the output of "-s" and make sure you are using 2.0-2; especially if you use a password in your mirrorlist. ;-) The results can be found at https://www.archlinux.de/?page=PackageStatistics It's quite rough and needs a lot of optimization. It also just shows the overall stats. Note, that my goal is to collect data that are actually useful to us and not just nice to see. If you have any suggestions or ideas let me know. I think once this is working properly we should make another announcement to get helpful stats for a package cleanup or some mirroring improvements. Greetings, Pierre -- Pierre Schmitz, https://users.archlinux.de/~pierre
On 09/10/2010 06:52 PM, Pierre Schmitz wrote:
Hi all,
two years ago (http://www.archlinux.org/news/419/) I created a stupid script to get some stats about package usage from our users.
I did some improvements. The client now submits the mirror used by pacman. On the server side statistics about the country (looked up by geoip) are stored and more important all data are stored with a timestamp. This way we should be able to see usage trends etc.. It would also be possible to have users submit data regularly without messing up all stats.
For now please have a look at the client itself. Check the output of "-s" and make sure you are using 2.0-2; especially if you use a password in your mirrorlist. ;-) The results can be found at https://www.archlinux.de/?page=PackageStatistics It's quite rough and needs a lot of optimization. It also just shows the overall stats.
Note, that my goal is to collect data that are actually useful to us and not just nice to see. If you have any suggestions or ideas let me know. I think once this is working properly we should make another announcement to get helpful stats for a package cleanup or some mirroring improvements.
Greetings,
Pierre
maybe we want to announce this again on forums and news to collect more data? -- Ionuț
On 10 September 2010 20:58, Ionuț Bîru <ibiru@archlinux.org> wrote:
maybe we want to announce this again on forums and news to collect more data? +1
-- andreascarpino.it Arch Linux Developer
On Fri, Sep 10, 2010 at 10:52 AM, Pierre Schmitz <pierre@archlinux.de> wrote:
Hi all,
two years ago (http://www.archlinux.org/news/419/) I created a stupid script to get some stats about package usage from our users.
I did some improvements. The client now submits the mirror used by pacman. On the server side statistics about the country (looked up by geoip) are stored and more important all data are stored with a timestamp. This way we should be able to see usage trends etc.. It would also be possible to have users submit data regularly without messing up all stats.
For now please have a look at the client itself. Check the output of "-s" and make sure you are using 2.0-2; especially if you use a password in your mirrorlist. ;-) The results can be found at https://www.archlinux.de/?page=PackageStatistics It's quite rough and needs a lot of optimization. It also just shows the overall stats.
Note, that my goal is to collect data that are actually useful to us and not just nice to see. If you have any suggestions or ideas let me know. I think once this is working properly we should make another announcement to get helpful stats for a package cleanup or some mirroring improvements.
I dream of a day when this stuff can all be in the main Arch website for everyone to see. Any interest in learning Django, Pierre? :) If anyone out there (yes, you users) knows Django and is looking to contribute to Arch or just hone your Django skills, I would be glad to chat with you about implementing a lot of these type of things in the main site. I've done quite a bit of work lately but I am only one person. Architecture differences is now implemented in the main site, and I am working to get a mirror status report in there as well. If we started thinking about this as well, that would really add some value to the site. I like the updates you've made. I wonder if it is worth trying to do something a little more sophisticated than just IP address for determining if someone is a repeat customer. The two cases are "my IP is dynamic and changes if I sneeze (thank you crappy DSL connection)" and "we have 200 machines behind one IP address". -Dan
On 09/10/2010 11:00 PM, Dan McGee wrote:
I like the updates you've made. I wonder if it is worth trying to do something a little more sophisticated than just IP address for determining if someone is a repeat customer. The two cases are "my IP is dynamic and changes if I sneeze (thank you crappy DSL connection)" and "we have 200 machines behind one IP address".
i noticed this myself when i tried to submit the data from other machine in my network. like an idea we can use the UUID from the root partition -- Ionuț
On Fri, Sep 10, 2010 at 16:15, Ionuț Bîru <ibiru@archlinux.org> wrote:
i noticed this myself when i tried to submit the data from other machine in my network.
like an idea we can use the UUID from the root partition Maybe.. Would it make more sense to take a hash of the eth0 mac address? Not sure if that is sensible... I guess the UUID doesn't change that often.
On Fri, 10 Sep 2010 16:16:46 -0400, Daenyth Blank <daenyth+arch@gmail.com> wrote:
On Fri, Sep 10, 2010 at 16:15, Ionuț Bîru <ibiru@archlinux.org> wrote:
i noticed this myself when i tried to submit the data from other machine in my network.
like an idea we can use the UUID from the root partition Maybe.. Would it make more sense to take a hash of the eth0 mac address? Not sure if that is sensible... I guess the UUID doesn't change that often.
Well, we have discussed all this before. If I don't limit the submission by ip it will be too easy for a single person to flood us with false data making the whole stats pointless. The ip is the only value you cannot easily spoof over internet. Whatever we would implement on the client side (pkgstats) doesn't matter as you still can post your data directly or just modify the script. (and yes, client ssl certs are overkill and people wont use pkgstats) One thing I could do though is to allow more than one submission per ip and day. what would be a reasonable value? Like 10 submission per ip within 24h? Greetings, Pierre -- Pierre Schmitz, https://users.archlinux.de/~pierre
On Fri, Sep 10, 2010 at 3:27 PM, Pierre Schmitz <pierre@archlinux.de> wrote:
On Fri, 10 Sep 2010 16:16:46 -0400, Daenyth Blank <daenyth+arch@gmail.com> wrote:
On Fri, Sep 10, 2010 at 16:15, Ionuț Bîru <ibiru@archlinux.org> wrote:
i noticed this myself when i tried to submit the data from other machine in my network.
like an idea we can use the UUID from the root partition Maybe.. Would it make more sense to take a hash of the eth0 mac address? Not sure if that is sensible... I guess the UUID doesn't change that often.
Well, we have discussed all this before. If I don't limit the submission by ip it will be too easy for a single person to flood us with false data making the whole stats pointless. The ip is the only value you cannot easily spoof over internet.
Sure- I'm not saying don't validate IP addresses at all, but the limit should probably be higher than 1 submission per IP in the given time frame.
Whatever we would implement on the client side (pkgstats) doesn't matter as you still can post your data directly or just modify the script. (and yes, client ssl certs are overkill and people wont use pkgstats)
One thing I could do though is to allow more than one submission per ip and day. what would be a reasonable value? Like 10 submission per ip within 24h?
What about something like this: 1. Submit something "unique" but relatively harmless- first network device MAC address seems reasonable. Root UUID would probably be a bit more work. 2. This suggestion forms a (IP, MAC) combo. If we've seen it before, let it through- what does it matter? We should just update the statistics list for this guy. 3. Same IP, new MAC, and MAC is nowhere else in system- let it through if we haven't had more than X (5? 10?) submissions in the last 24h from this IP. 4. Same IP, new MAC, MAC is already in system- update the stored IP address of the system entry, allow submission through overwriting old submission. 5. Different IP, MAC already in system- same as above in 4- change the system entry and then allow submission, replacing old values. 6. And so on- we can "trust" IP address, we can't trust MAC address. Every month, cull the stats- if we haven't heard from you in two months, you are removed from the counted values. Gather submissions once a week or so. Thus someone that wanted to poison the stats would have to keep up with the submissions from all of their bogus MAC addresses. -Dan
On 11 September 2010 04:16, Daenyth Blank <daenyth+arch@gmail.com> wrote:
On Fri, Sep 10, 2010 at 16:15, Ionuț Bîru <ibiru@archlinux.org> wrote:
i noticed this myself when i tried to submit the data from other machine in my network.
like an idea we can use the UUID from the root partition Maybe.. Would it make more sense to take a hash of the eth0 mac address? Not sure if that is sensible... I guess the UUID doesn't change that often.
That's right, it only changes when you format the hard disk. But Dan's IP+MAC algorithm above looks pretty good to me. -- GPG/PGP ID: B42DDCAD
On Fri, Sep 10, 2010 at 5:44 PM, Ray Rashif <schivmeister@gmail.com> wrote:
On 11 September 2010 04:16, Daenyth Blank <daenyth+arch@gmail.com> wrote:
On Fri, Sep 10, 2010 at 16:15, Ionuț Bîru <ibiru@archlinux.org> wrote:
i noticed this myself when i tried to submit the data from other machine in my network.
like an idea we can use the UUID from the root partition Maybe.. Would it make more sense to take a hash of the eth0 mac address? Not sure if that is sensible... I guess the UUID doesn't change that often.
That's right, it only changes when you format the hard disk. But Dan's IP+MAC algorithm above looks pretty good to me.
We could always just call uuidgen -r on package install and store that in /etc/. That is the easiest thing in the world to then grab later. -Dan
On Fri, 10 Sep 2010 15:46:53 -0500, Dan McGee <dpmcgee@gmail.com> wrote:
On Fri, Sep 10, 2010 at 3:27 PM, Pierre Schmitz <pierre@archlinux.de> wrote:
Well, we have discussed all this before. If I don't limit the submission by ip it will be too easy for a single person to flood us with false data making the whole stats pointless. The ip is the only value you cannot easily spoof over internet.
Sure- I'm not saying don't validate IP addresses at all, but the limit should probably be higher than 1 submission per IP in the given time frame.
You are now allowed to do 10 submission per IP within 24h. Of course there are always corner cases but I am happy if we catch most use cases here without making it to easy to screw the whole stats for one single person.
What about something like this: 1. Submit something "unique" but relatively harmless- first network device MAC address seems reasonable. Root UUID would probably be a bit more work. 2. This suggestion forms a (IP, MAC) combo. If we've seen it before, let it through- what does it matter? We should just update the statistics list for this guy. 3. Same IP, new MAC, and MAC is nowhere else in system- let it through if we haven't had more than X (5? 10?) submissions in the last 24h from this IP. 4. Same IP, new MAC, MAC is already in system- update the stored IP address of the system entry, allow submission through overwriting old submission. 5. Different IP, MAC already in system- same as above in 4- change the system entry and then allow submission, replacing old values. 6. And so on- we can "trust" IP address, we can't trust MAC address.
Thinking about that it's probably not worth the effort. The MAC address or /-uuid would just be a user submitted value. This wouldn't make it any harder for idiots to flood the db. It just increases the workload on our side. We would also collect more data from the users than we need which might raise privacy concerns.
Every month, cull the stats- if we haven't heard from you in two months, you are removed from the counted values. Gather submissions once a week or so. Thus someone that wanted to poison the stats would have to keep up with the submissions from all of their bogus MAC addresses.
Indeed. The more fair users participate on a regular base the better the results are and some false data from idiots wont matter. -- Pierre Schmitz, https://users.archlinux.de/~pierre
participants (6)
-
Andrea Scarpino
-
Daenyth Blank
-
Dan McGee
-
Ionuț Bîru
-
Pierre Schmitz
-
Ray Rashif