Re: [arch-general] [arch-dev-public] AUR ToS (aka making AUR user names public)
On 05-03-2017 13:35, Lukas Fleischer wrote:
Hi,
I was recently contacted by a Polish researcher asking for a list of AUR account names. I did not expect this to be controversial but a couple of Trusted Users raised concerns on IRC, so I decided to move this to the public mailing list and discuss the whole topic in generality. I would like to head more opinions but please read the whole email and give it a second thought before simply bringing up the usual privacy arguments mentioned below.
My original questions was: Are we fine with sharing the list of AUR accounts names (only user names, no real names or email addresses) with a researcher that seems trustworthy and agrees to not share the data in any form other than the resulting anonymized statistics?
In this particular case, we are talking about Dorota Celinska [1] from the University of Warsaw, Faculty of Economic Sciences [2], see [3] for a list of her publications and [4] for a summary of her research project funded recently by the Polish National Science Centre. She needs the list of user names to perform a segmentation analysis, including users which were active on the older AUR releases both do not show any activity on AUR 4. She would also like to use the user names as identifiers to establish connections with other platforms, such as GitHub.
The next question is: Would it make sense to even make this data publicly available? Would it make sense to extend our RPC interface such that one can search for users names? GitHub, for example, already provides such an interface [5]. Let me quickly summarize some arguments for this idea which came up on IRC:
* User names are mostly identifiers. It is questionable whether they can/should be considered personal/private information. Maybe this can only be answered by a lawyer, though.
* The user names of all accounts with any kind of public activity, like uploading a package, filing a request, writing a comment, are public already.
* After logging into the aurweb interface, you can already check whether an account with a given user name exists because the account details page URIs have the form https://aur.archlinux.org/account/$username. This means that for any platform providing a list of user names (such as GitHub), you can "establish connections" with the AUR already.
Now the arguments against:
* Principle of data economy: We should not share any kind of information we do not need to share.
* Sharing user names lowers the threshold for sharing other information which is considered more confidential.
* Users can (and should) already use crawlers to fetch the user names. For example, the user names of all package maintainers and comment authors appear on the package details pages. The names of all users filing package requests appear in the mailing list archives etc.
* We do not have ToS so we better not share anything.
I, personally, find the second last argument a very weak one. Telling users to build crawlers scraping an brute-forcing our HTML pages makes life difficult for both them and us. What do you think?
On the other side of the coin, the last argument is a very good one and it brings me to my last point. Independently of the outcome of this discussion, I think we should add some ToS that users need to agree upon when registering. It should contain information on liability and on privacy. Is anybody willing to write a draft? Do we need the support of a lawyer here?
Thank you for your time and have a nice Sunday!
Regards, Lukas
[1] http://coin.wne.uw.edu.pl/dcelinska/en/ [2] https://www.wne.uw.edu.pl/index.php/en/ [3] http://coin.wne.uw.edu.pl/dcelinska/en/pages/publications.html [4] https://ncn.gov.pl/sites/default/files/listy-rankingowe/2016-03-15/streszcze... [5] https://developer.github.com/v3/users/
I'd say err on the caution side and don't share, even though the usernames are public and easy to find by scraping them from the website/mailing list/etc, handing the whole database of usernames in a silver platter is a whole different story, which is what is being asked. Is there any community/website that provides a full list of registered usernames on request? There is also the question of how useful that data would be, without any other data such as email the username list is useless, you have no guarantee that user foo on github is the same person as user foo on the AUR/Wiki/Forum or user foo somewhere else. In this case I'd also have to agree that sharing usernames lowers the threshold for sharing other information. It also doesn't fit with their stated research goals, only github and projects associated with scraping data from github are mentioned, why would they want to throw the AUR usernames in the mix? -- Mauro Santos
Giving away any data is bad, period. I hate this fashion that nowadays every "expert" holding a share is granted access to data, that even the NSA isn't getting that easy. Starting to give away such data to "researchers" is evil, let alone that all that "serious" statistics are just bullshit. No exceptions in regards to privacy! An "exceptions" is a violation of privacy. Why not giving away telephone numbers? Everybody anyway could dial each available number, so it doesn't matter to give away the numbers, right? ;)
Isn't Arch BBS already providing list of usernames? In general, though, I'd say follow the principle of least effort. Why just not publish the list of usernames and that's all? This way, new users can easily grep for them and don't need scrapers, and "researchers" can have fun... On Mon, Mar 06, 2017 at 12:36:35AM +0100, Ralf Mardorf wrote:
Giving away any data is bad, period. I hate this fashion that nowadays every "expert" holding a share is granted access to data, that even the NSA isn't getting that easy. Starting to give away such data to "researchers" is evil, let alone that all that "serious" statistics are just bullshit. No exceptions in regards to privacy! An "exceptions" is a violation of privacy. Why not giving away telephone numbers? Everybody anyway could dial each available number, so it doesn't matter to give away the numbers, right? ;)
Oh, please. Not the usual NSA crap again. Cheers, -- Leonid Isaev
On Sun, Mar 05, 2017 at 18:14:02 -0700, Leonid Isaev wrote:
Isn't Arch BBS already providing list of usernames?
The BBS's user list is only available to logged-in users. Although that is certainly not an extended privacy measure, it still prevents random people who just "pass by" from extracting the user list.
In general, though, I'd say follow the principle of least effort. Why just not publish the list of usernames and that's all? This way, new users can easily grep for them and don't need scrapers, and "researchers" can have fun...
Because anonymisation: even if one dataset in isolation may look unsuspicious from a privacy POV, if combined with other datasets, it may suddenly reveal information that was not intended to be public. I admit that a simple one-column list of user nick names may probably not really be joinable with other datasets or -tables in any useful manner, but it is still not always obvious how data can be (ab)used (see also [1]). I would not give out the user list. Even if there are means for everybody to somehow obtain the data (with enough effort from their side), it is not the same thing as simply handing it out conveniently prepared and formatted. Best, Tinu [1] http://archive.wired.com/politics/security/commentary/securitymatters/2007/1...
I was under the impression that the AUR git interface is just one big git repo. Yes it checks out only the package you clone but the references contain all packages (and commits). Am I mistaken to this? Regards, -- Leonidas Spyropoulos A: Because it messes up the order in which people normally read text. Q: Why is it such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
On Mon, Mar 06, 2017 at 09:44:12AM +0100, Tinu Weber wrote:
Because anonymisation: even if one dataset in isolation may look unsuspicious from a privacy POV, if combined with other datasets, it may suddenly reveal information that was not intended to be public.
I admit that a simple one-column list of user nick names may probably not really be joinable with other datasets or -tables in any useful manner, but it is still not always obvious how data can be (ab)used (see also [1]).
I would not give out the user list. Even if there are means for everybody to somehow obtain the data (with enough effort from their side), it is not the same thing as simply handing it out conveniently prepared and formatted.
See, the rule should be that private information is the one that is manifestly marked so. For example, a password or a secret key is private information which you never ever disclose to anyone. But a username is by definition open. Therefore, if your privacy relies on a web service not disclosing usernames, you haven't considered the threat model carefuly enough. What I'm saying is just another example of avoiding security through obscurity: don't rely on a web service not advertising your usernames, if this is an issue, make each username a random string (which defeats the attack [1]).
[1] http://archive.wired.com/politics/security/commentary/securitymatters/2007/1...
Cheers, -- Leonid Isaev
On Sun, 5 Mar 2017 18:14:02 -0700, Leonid Isaev wrote:
Isn't Arch BBS already providing list of usernames?
In general, though, I'd say follow the principle of least effort. Why just not publish the list of usernames and that's all? This way, new users can easily grep for them and don't need scrapers, and "researchers" can have fun...
On Mon, Mar 06, 2017 at 12:36:35AM +0100, Ralf Mardorf wrote:
Giving away any data is bad, period. I hate this fashion that nowadays every "expert" holding a share is granted access to data, that even the NSA isn't getting that easy. Starting to give away such data to "researchers" is evil, let alone that all that "serious" statistics are just bullshit. No exceptions in regards to privacy! An "exceptions" is a violation of privacy. Why not giving away telephone numbers? Everybody anyway could dial each available number, so it doesn't matter to give away the numbers, right? ;)
Oh, please. Not the usual NSA crap again.
I did not wrote about the NSA. I only pointed out that even the NSA doesn't get all the data as a gift. Why should a researcher get such data as a gift? You are seemingly already that used to data mining and offended privacy, that it's good and natural from your point of view, if data is misused and any concerns are just crap in your opinion. That usernames are used in public and maybe even a list might be already published, is different to actively give the same data away to "researchers" and to formally allow them to use the data. You seem not to understand the principle of privacy. If you don't lock the street door, this does not automatically indicate that you want people to come into your house and take away your property. Btw. what is the aim of the research and how could the research be used or possibly misused? We don't need to care about such questions, if data isn't given away as a matter of principle.
On 03/06/2017 10:03 AM, Ralf Mardorf wrote:
I did not wrote about the NSA. I only pointed out that even the NSA doesn't get all the data as a gift. Why should a researcher get such data as a gift? You are seemingly already that used to data mining and offended privacy, that it's good and natural from your point of view, if data is misused and any concerns are just crap in your opinion. That usernames are used in public and maybe even a list might be already published, is different to actively give the same data away to "researchers" and to formally allow them to use the data. You seem not to understand the principle of privacy. If you don't lock the street door, this does not automatically indicate that you want people to come into your house and take away your property. Btw. what is the aim of the research and how could the research be used or possibly misused? We don't need to care about such questions, if data isn't given away as a matter of principle.
I also don't think that the list should be published. -- GPG fingerprint: 871F 1047 7DB3 DDED 5FC4 47B2 26C7 E577 EF96 7808
I guess I'll be the devil's advocate. I see no privacy issues in handing over a list of already public information You could deny it for practical reasons though, if you simply could not be bothered to scrape/export such a list yourself. Denying or allowing won't stop anyone from obtaining the list. If users were concerned about their usernames being public, they shouldn't have submitted them publicly. Public information is public, deal with it and stop being so paranoid, they're gonna get you anyway. ;)
On Mon, 6 Mar 2017 10:52:51 +0100, Henrik Danielsson via arch-general wrote:
I guess I'll be the devil's advocate. I see no privacy issues in handing over a list of already public information You could deny it for practical reasons though, if you simply could not be bothered to scrape/export such a list yourself. Denying or allowing won't stop anyone from obtaining the list.
If users were concerned about their usernames being public, they shouldn't have submitted them publicly. Public information is public, deal with it and stop being so paranoid, they're gonna get you anyway. ;)
Privacy is a principle. You seem not to understand the difference between giving somebody data with the formal permission to use this data and data that simply is available for everybody, but not explicitly handed over to somebody. Paranoia isn't involved in my concern.
2017-03-06 11:18 GMT+01:00 Ralf Mardorf <silver.bullet@zoho.com>:
Privacy is a principle. You seem not to understand the difference between giving somebody data with the formal permission to use this data and data that simply is available for everybody, but not explicitly handed over to somebody. Paranoia isn't involved in my concern.
My standpoint is that privacy does not apply to this kind of public information, simply because it's not private and by no means sensitive (people freely chose the username and other visible info they posted, no?). Thus, no, I see no difference and really no point in even considering trying to keep such information private. What anyone does with the freely available information posted in the AUR is up to them ("mining" it or handing it over to someone else included), we could not do anything about it anyway, nor would I even care if I was in that list or not, since there seems to be no ToS between the one submitting that information and the one publishing it. Since it was freely submitted without any terms, I can simply not find any restrictions on its usage. Yes, we should have a ToS to at least keep the principle of privacy alive. But let's face it, real privacy online has been dead for long, if it ever existed. If there was a ToS, the situation would perhaps have been different, at least legally. I'm no legal expert of course, but to me it makes perfect sense that if you posted something on the internet, in a very public space, you can have no expectations of keeping any of that information private in any way, nor any information easily associated with. No, I don't see that as a problem, at least not if you never explicitly agreed that information would not be shared. What I really want to keep private I don't post anywhere.
On 06-03-2017 11:20, Henrik Danielsson via arch-general wrote:
2017-03-06 11:18 GMT+01:00 Ralf Mardorf <silver.bullet@zoho.com>:
Privacy is a principle. You seem not to understand the difference between giving somebody data with the formal permission to use this data and data that simply is available for everybody, but not explicitly handed over to somebody. Paranoia isn't involved in my concern.
My standpoint is that privacy does not apply to this kind of public information, simply because it's not private and by no means sensitive (people freely chose the username and other visible info they posted, no?). Thus, no, I see no difference and really no point in even considering trying to keep such information private.
What anyone does with the freely available information posted in the AUR is up to them ("mining" it or handing it over to someone else included), we could not do anything about it anyway, nor would I even care if I was in that list or not, since there seems to be no ToS between the one submitting that information and the one publishing it. Since it was freely submitted without any terms, I can simply not find any restrictions on its usage.
Yes, we should have a ToS to at least keep the principle of privacy alive. But let's face it, real privacy online has been dead for long, if it ever existed.
If there was a ToS, the situation would perhaps have been different, at least legally. I'm no legal expert of course, but to me it makes perfect sense that if you posted something on the internet, in a very public space, you can have no expectations of keeping any of that information private in any way, nor any information easily associated with. No, I don't see that as a problem, at least not if you never explicitly agreed that information would not be shared. What I really want to keep private I don't post anywhere.
I think the point here is not so much privacy, as I believe everyone recognizes that the information that was asked for (the full list of usernames) is public and can be scraped. The point here is handing over the full list of usernames on request. Do note that in their research proposal[1] they specifically mention scraping information from github. That information is public, github does have an API to query that information, but they still have to scrape it, I suppose that implies github does not hand it over wholesale on request, why should we? This might be due to their ToS or they know something we don't. [1] https://ncn.gov.pl/sites/default/files/listy-rankingowe/2016-03-15/streszcze... -- Mauro Santos
On Mon, 6 Mar 2017 11:53:34 +0000, Mauro Santos via arch-general wrote:
I think the point here is not so much privacy, as I believe everyone recognizes that the information that was asked for (the full list of usernames) is public
It's not per se forbidden to take a photo of a public location, it's even allowed to take the photo and to publish the photo, if a girl randomly is on that photo, too. It is forbidden to provide a collection of such photos to somebody else, who needs such photos for a porn website. Now "research" isn't "porn", but subtleties could make it hard to decide how to handle something like this. That something is public, doesn't mean that privacy could be ignored.
On 06-03-2017 12:13, Ralf Mardorf wrote:
On Mon, 6 Mar 2017 11:53:34 +0000, Mauro Santos via arch-general wrote:
I think the point here is not so much privacy, as I believe everyone recognizes that the information that was asked for (the full list of usernames) is public
It's not per se forbidden to take a photo of a public location, it's even allowed to take the photo and to publish the photo, if a girl randomly is on that photo, too. It is forbidden to provide a collection of such photos to somebody else, who needs such photos for a porn website. Now "research" isn't "porn", but subtleties could make it hard to decide how to handle something like this. That something is public, doesn't mean that privacy could be ignored. .
I'm not saying privacy doesn't matter, it does. The usernames are there for everyone to see, there is no expectation of privacy on that, or the comments on packages. What I feel is the crux of the problem here is handing the list (or database) of users wholesale. I believe you have framed the main question better than I have in one of your replies :) -- Mauro Santos
2017-03-06 12:53 GMT+01:00 Mauro Santos via arch-general < arch-general@archlinux.org>:
On 06-03-2017 11:20, Henrik Danielsson via arch-general wrote:
2017-03-06 11:18 GMT+01:00 Ralf Mardorf <silver.bullet@zoho.com>:
Privacy is a principle. You seem not to understand the difference between giving somebody data with the formal permission to use this data and data that simply is available for everybody, but not explicitly handed over to somebody. Paranoia isn't involved in my concern.
My standpoint is that privacy does not apply to this kind of public information, simply because it's not private and by no means sensitive (people freely chose the username and other visible info they posted, no?). Thus, no, I see no difference and really no point in even considering trying to keep such information private.
What anyone does with the freely available information posted in the AUR is up to them ("mining" it or handing it over to someone else included), we could not do anything about it anyway, nor would I even care if I was in that list or not, since there seems to be no ToS between the one submitting that information and the one publishing it. Since it was freely submitted without any terms, I can simply not find any restrictions on its usage.
Yes, we should have a ToS to at least keep the principle of privacy alive. But let's face it, real privacy online has been dead for long, if it ever existed.
If there was a ToS, the situation would perhaps have been different, at least legally. I'm no legal expert of course, but to me it makes perfect sense that if you posted something on the internet, in a very public space, you can have no expectations of keeping any of that information private in any way, nor any information easily associated with. No, I don't see that as a problem, at least not if you never explicitly agreed that information would not be shared. What I really want to keep private I don't post anywhere.
I think the point here is not so much privacy, as I believe everyone recognizes that the information that was asked for (the full list of usernames) is public and can be scraped.
The point here is handing over the full list of usernames on request. Do note that in their research proposal[1] they specifically mention scraping information from github. That information is public, github does have an API to query that information, but they still have to scrape it, I suppose that implies github does not hand it over wholesale on request, why should we? This might be due to their ToS or they know something we don't.
It would be rather interesting to see what they could come up with from that correlation. I think, perhaps a bit cynically, the reason github may not hand over that data directly is likely that they don't want to do some of the work of the researchers for them. As you said, the data is there, the format matters less if they're going to massage it into something else later anyway, so why bother with the effort of compiling it on their [github] own time? We could simply deny the AUR username request it for the same reason, or no reason at all. Since some people seem uncomfortable about what could be derived from a potential correlation of publicly available data, that's most likely the safest way to go.
On 06-03-2017 12:45, Henrik Danielsson via arch-general wrote:
2017-03-06 12:53 GMT+01:00 Mauro Santos via arch-general < arch-general@archlinux.org>:
On 06-03-2017 11:20, Henrik Danielsson via arch-general wrote:
2017-03-06 11:18 GMT+01:00 Ralf Mardorf <silver.bullet@zoho.com>:
Privacy is a principle. You seem not to understand the difference between giving somebody data with the formal permission to use this data and data that simply is available for everybody, but not explicitly handed over to somebody. Paranoia isn't involved in my concern.
My standpoint is that privacy does not apply to this kind of public information, simply because it's not private and by no means sensitive (people freely chose the username and other visible info they posted, no?). Thus, no, I see no difference and really no point in even considering trying to keep such information private.
What anyone does with the freely available information posted in the AUR is up to them ("mining" it or handing it over to someone else included), we could not do anything about it anyway, nor would I even care if I was in that list or not, since there seems to be no ToS between the one submitting that information and the one publishing it. Since it was freely submitted without any terms, I can simply not find any restrictions on its usage.
Yes, we should have a ToS to at least keep the principle of privacy alive. But let's face it, real privacy online has been dead for long, if it ever existed.
If there was a ToS, the situation would perhaps have been different, at least legally. I'm no legal expert of course, but to me it makes perfect sense that if you posted something on the internet, in a very public space, you can have no expectations of keeping any of that information private in any way, nor any information easily associated with. No, I don't see that as a problem, at least not if you never explicitly agreed that information would not be shared. What I really want to keep private I don't post anywhere.
I think the point here is not so much privacy, as I believe everyone recognizes that the information that was asked for (the full list of usernames) is public and can be scraped.
The point here is handing over the full list of usernames on request. Do note that in their research proposal[1] they specifically mention scraping information from github. That information is public, github does have an API to query that information, but they still have to scrape it, I suppose that implies github does not hand it over wholesale on request, why should we? This might be due to their ToS or they know something we don't.
It would be rather interesting to see what they could come up with from that correlation.
Probably nothing meaningful. As I've said before you have no way of knowing if user foo on github is the same as user foo on the AUR.
I think, perhaps a bit cynically, the reason github may not hand over that data directly is likely that they don't want to do some of the work of the researchers for them. As you said, the data is there, the format matters less if they're going to massage it into something else later anyway, so why bother with the effort of compiling it on their [github] own time?
We could simply deny the AUR username request it for the same reason, or no reason at all. Since some people seem uncomfortable about what could be derived from a potential correlation of publicly available data, that's most likely the safest way to go.
-- Mauro Santos
2017-03-06 14:36 GMT+01:00 Mauro Santos via arch-general <arch-general@archlinux.org>:
On 06-03-2017 12:45, Henrik Danielsson via arch-general wrote:
2017-03-06 12:53 GMT+01:00 Mauro Santos via arch-general < arch-general@archlinux.org>:
On 06-03-2017 11:20, Henrik Danielsson via arch-general wrote:
2017-03-06 11:18 GMT+01:00 Ralf Mardorf <silver.bullet@zoho.com>:
Privacy is a principle. You seem not to understand the difference between giving somebody data with the formal permission to use this data and data that simply is available for everybody, but not explicitly handed over to somebody. Paranoia isn't involved in my concern.
My standpoint is that privacy does not apply to this kind of public information, simply because it's not private and by no means sensitive (people freely chose the username and other visible info they posted, no?). Thus, no, I see no difference and really no point in even considering trying to keep such information private.
What anyone does with the freely available information posted in the AUR is up to them ("mining" it or handing it over to someone else included), we could not do anything about it anyway, nor would I even care if I was in that list or not, since there seems to be no ToS between the one submitting that information and the one publishing it. Since it was freely submitted without any terms, I can simply not find any restrictions on its usage.
Yes, we should have a ToS to at least keep the principle of privacy alive. But let's face it, real privacy online has been dead for long, if it ever existed.
If there was a ToS, the situation would perhaps have been different, at least legally. I'm no legal expert of course, but to me it makes perfect sense that if you posted something on the internet, in a very public space, you can have no expectations of keeping any of that information private in any way, nor any information easily associated with. No, I don't see that as a problem, at least not if you never explicitly agreed that information would not be shared. What I really want to keep private I don't post anywhere.
I think the point here is not so much privacy, as I believe everyone recognizes that the information that was asked for (the full list of usernames) is public and can be scraped.
The point here is handing over the full list of usernames on request. Do note that in their research proposal[1] they specifically mention scraping information from github. That information is public, github does have an API to query that information, but they still have to scrape it, I suppose that implies github does not hand it over wholesale on request, why should we? This might be due to their ToS or they know something we don't.
It would be rather interesting to see what they could come up with from that correlation.
Probably nothing meaningful. As I've said before you have no way of knowing if user foo on github is the same as user foo on the AUR.
True, but you could make a decent guess based on how many coincidences there are surrounding those names. Relations between names could be interesting even if the people behind them are not the same.
I think, perhaps a bit cynically, the reason github may not hand over that data directly is likely that they don't want to do some of the work of the researchers for them. As you said, the data is there, the format matters less if they're going to massage it into something else later anyway, so why bother with the effort of compiling it on their [github] own time?
We could simply deny the AUR username request it for the same reason, or no reason at all. Since some people seem uncomfortable about what could be derived from a potential correlation of publicly available data, that's most likely the safest way to go.
-- Mauro Santos
On Mon, 6 Mar 2017 13:45:37 +0100, Henrik Danielsson wrote:
We could simply deny the AUR username request it for the same reason, or no reason at all. Since some people seem uncomfortable about what could be derived from a potential correlation of publicly available data, that's most likely the safest way to go.
Even if all users would agree to hand out a username list, why risking a possible issue for some research, that seems to gain nothing for the Arch community and as far as I can see even not for human kind? To be honest, I can't name a real issue, I only could imagine very abstract issues. I don't understand that research at all. Much likely nothing bad would happen by handing out a list, but to avoid a "Now, why didn't I think of that?"-issue the easiest solution seems to reject such requests in general, at least as long as it's not obviously that the research is "good" (what ever this means) for the Arch community and/or human kind or the universe in general.
On Mon, 6 Mar 2017 13:45:37 +0100, Henrik Danielsson wrote:
We could simply deny the AUR username request it for the same reason, or no reason at all. Since some people seem uncomfortable about what could be derived from a potential correlation of publicly available data, that's most likely the safest way to go.
Even if all users would agree to hand out a username list, why risking a possible issue for some research, that seems to gain nothing for the Arch community and as far as I can see even not for human kind? To be honest, I can't name a real issue, I only could imagine very abstract issues. I don't understand that research at all. Much likely nothing bad would happen by handing out a list, but to avoid a "Now, why didn't I think of that?"-issue the easiest solution seems to reject such requests in general, at least as long as it's not obviously that the research is "good" (what ever this means) for the Arch community and/or human kind or the universe in general. Well, there's probably a lot of research results we did not know the
2017-03-06 15:01 GMT+01:00 Ralf Mardorf <silver.bullet@zoho.com>: positive [or any] effects of beforehand. I also doubt we'll find some drastically new improved way of life because of this, but not all research aims for that. Satisfying curiosity would be enough reason for most research IMHO. Learning there is nothing there is also learning.
On Mon, Mar 6, 2017 at 3:01 PM, Ralf Mardorf <silver.bullet@zoho.com> wrote:
Much likely nothing bad would happen by handing out a list, but to avoid a "Now, why didn't I think of that?"-issue the easiest solution seems to reject such requests in general, at least as long as it's not obviously that the research is "good" (what ever this means) for the Arch community and/or human kind or the universe in general.
So you're admitting that you can't come up with a real concern and are opposing just for the sake of opposing. I think it's important to discuss such issues a lot in general, because they improve our reasoning. But just to take away the outcome of your well-meaning pessimism: There could be a white hair in that soup just as there could indeed be a black hair in the opposite colored "Now, why didn't I think of that?" soup of well-meaning optimism you warn against. If you can follow this rahter heavily metaphoric dish of thought, it looks like we learned something today, in that case. cheers! mar77i
On Mon, 6 Mar 2017 15:18:43 +0100, Martin Kühne via arch-general wrote:
On Mon, Mar 6, 2017 at 3:01 PM, Ralf Mardorf <silver.bullet@zoho.com> wrote:
Much likely nothing bad would happen by handing out a list, but to avoid a "Now, why didn't I think of that?"-issue the easiest solution seems to reject such requests in general, at least as long as it's not obviously that the research is "good" (what ever this means) for the Arch community and/or human kind or the universe in general.
So you're admitting that you can't come up with a real concern and are opposing just for the sake of opposing.
Wrong! Protection of privacy is something that requires much thinking and much weighting. Abstract imagination of issues is reason enough to deny such a request, as long as the researcher doesn't plausibly explains the benefit of the research. If somebody wants to hand out the requested data, this person should provide more easy to understand information, that isn't too long to read. On Mon, 6 Mar 2017 15:07:53 +0100, Henrik Danielsson wrote:
I also doubt we'll find some drastically new improved way of life because of this, but not all research aims for that. Satisfying curiosity would be enough reason for most research IMHO. Learning there is nothing there is also learning.
Curiosity about what? How many equal nicknames were used on AUR and github and what kind of software is related to those nicknames? A researcher is seriously interested in this information? Not in something else? How do you know that this research is about learning something positive? We, the Internet and/or phone home app users already suffer from much misused data. It's reasonable to be sceptic in regards to protection of privacy. Has got somebody the slightest idea about the aim of this research? "anonymized statistics" and "establish connections" are abstract phrases. Not abstract is that those claims are contradictory, without the need of much abstract concerns or paranoia. In the end I don't care, since I more or less have given up that nowadays people are interested in really thinking about protection of privacy, hence I'll op out, I only wanted to point out my doubts. Done. Regards, Ralf
On Mon, Mar 6, 2017 at 3:53 PM, Ralf Mardorf <silver.bullet@zoho.com> wrote:
Has got somebody the slightest idea about the aim of this research?
good question.
"anonymized statistics" and "establish connections" are abstract phrases. Not abstract is that those claims are contradictory, without the need of much abstract concerns or paranoia.
none of these are crimes. and xxxjavaturtle69xxx actually writes vectorgraphics in java.
In the end I don't care, since I more or less have given up that nowadays people are interested in really thinking about protection of privacy, hence I'll op out, I only wanted to point out my doubts. Done.
Your approach kind of reminds me about how statistics is a much misunderstood field. It doesn't matter what or how you record statistics, you will always going to get rid of most of the data for the sake of having a general overview, and nothing keeps you from misinterpreting the results you get. That *still* doesn't make the tools completely useless, as it's great for grouping many data points into individual sectors. Of course it's not a simple topic, but you can't fit a one-opt-out-fits-all-opt-outs approach to the problem domain and think you're done? cheers! mar77i
Hi, ok a last reply to this topic. Since the usernames are anyway public, there is a reason to ask for a list. - politeness? - laziness? - something related to laws? - ?? Perhaps the research has nothing to do with AUR and github, but e.g. with a method, maybe an algorithm to "establish connections", perhaps for manipulation purpose? I've got much fantasy about a lot of "good" but as well "evil" reasons. On Mon, 6 Mar 2017 16:06:30 +0100, Martin Kühne via arch-general wrote:
On Mon, Mar 6, 2017 at 3:53 PM, Ralf Mardorf <silver.bullet@zoho.com> wrote:
Has got somebody the slightest idea about the aim of this research?
good question.
"anonymized statistics" and "establish connections" are abstract phrases. Not abstract is that those claims are contradictory, without the need of much abstract concerns or paranoia.
none of these are crimes. and xxxjavaturtle69xxx actually writes vectorgraphics in java.
Researchers sometimes misuse real records, not to harm those who originally own those records, they just want to test processes that later should be used for something that isn't related to those "test" records.
In the end I don't care, since I more or less have given up that nowadays people are interested in really thinking about protection of privacy, hence I'll op out, I only wanted to point out my doubts. Done.
Your approach kind of reminds me about how statistics is a much misunderstood field. It doesn't matter what or how you record statistics, you will always going to get rid of most of the data for the sake of having a general overview, and nothing keeps you from misinterpreting the results you get. That *still* doesn't make the tools completely useless, as it's great for grouping many data points into individual sectors. Of course it's not a simple topic, but you can't fit a one-opt-out-fits-all-opt-outs approach to the problem domain and think you're done?
No, as already pointed out, we don't know what this research is for. Who says that the target of the research are statistics? This statistic thingy perhaps is just to test or train something, that later should be used for something completely different. Now I'll use my fantasy for continuing a music project done with Linux. Regards, Ralf
Op 6 mrt. 2017 10:52 schreef "Henrik Danielsson via arch-general" < arch-general@archlinux.org>: I guess I'll be the devil's advocate. I see no privacy issues in handing over a list of already public information You could deny it for practical reasons though, if you simply could not be bothered to scrape/export such a list yourself. Denying or allowing won't stop anyone from obtaining the list. I'd say don't. It's not that the information cannot be obtained otherwise, but I believe it makes a legal difference. Also, I don't see any advantage for ArchLinux to handing overy this info. If they really want this info to profile AUR users and contributers, they can either compile their own info (using git or scraping), or they could use an opt-in mechanism and ask the users if they want to participate. I know it's not directly an privacy issue, but I find it scary nonetheless... (especially since they expressed the wish to consolidate the data with other websites such as github). Mvg, Guus Snijders
On Mon, Mar 6, 2017 at 11:26 AM, Guus Snijders via arch-general <arch-general@archlinux.org> wrote:
Op 6 mrt. 2017 10:52 schreef "Henrik Danielsson via arch-general" < arch-general@archlinux.org>:
I guess I'll be the devil's advocate. I see no privacy issues in handing over a list of already public information You could deny it for practical reasons though, if you simply could not be bothered to scrape/export such a list yourself. Denying or allowing won't stop anyone from obtaining the list.
Gaetan's criticism applies to you here, now. please designate paragraphs of text which you reply to.
I know it's not directly an privacy issue, but I find it scary nonetheless... (especially since they expressed the wish to consolidate the data with other websites such as github).
This is exactly for the argument I was struggling to come up with. As far as I followed the discussion, this was the first time (I realized?) someone clearly disconnected the argument from the privacy discussion. Put this way, it makes sense to me, too. For the practical implications we'd hand over along, with a note of "do whatever you want with it we don't care". Turns out we do care what someone else does with the data. cheers! mar77i
I really, don't want to make it any easier for someone to spam me based on correlations between account names. On Mon, Mar 6, 2017 at 4:39 AM, Martin Kühne via arch-general < arch-general@archlinux.org> wrote:
Op 6 mrt. 2017 10:52 schreef "Henrik Danielsson via arch-general" < arch-general@archlinux.org>:
I guess I'll be the devil's advocate. I see no privacy issues in handing over a list of already public information You could deny it for
On Mon, Mar 6, 2017 at 11:26 AM, Guus Snijders via arch-general <arch-general@archlinux.org> wrote: practical
reasons though, if you simply could not be bothered to scrape/export such a list yourself. Denying or allowing won't stop anyone from obtaining the list.
Gaetan's criticism applies to you here, now. please designate paragraphs of text which you reply to.
I know it's not directly an privacy issue, but I find it scary nonetheless... (especially since they expressed the wish to consolidate the data with other websites such as github).
This is exactly for the argument I was struggling to come up with. As far as I followed the discussion, this was the first time (I realized?) someone clearly disconnected the argument from the privacy discussion. Put this way, it makes sense to me, too. For the practical implications we'd hand over along, with a note of "do whatever you want with it we don't care". Turns out we do care what someone else does with the data.
cheers! mar77i
2017-03-06 11:39 GMT+01:00 Martin Kühne via arch-general < arch-general@archlinux.org>:
Gaetan's criticism applies to you here, now. please designate paragraphs of text which you reply to.
I was not replying to anyone in particular. Gaetan? Sorry, you lost me there.
On Mon, Mar 6, 2017 at 12:21 PM, Henrik Danielsson via arch-general <arch-general@archlinux.org> wrote:
I was not replying to anyone in particular. Gaetan? Sorry, you lost me there.
It may not have appeared in the same thread for you, but here we go [0] context, and the mail I was replying to was [1], for which the former applies. cheers! mar77i [0] http://www.mail-archive.com/arch-dev-public@archlinux.org/msg25123.html [1] http://www.mail-archive.com/arch-general@archlinux.org/msg43113.html
2017-03-06 12:58 GMT+01:00 Martin Kühne via arch-general <arch-general@archlinux.org>:
On Mon, Mar 6, 2017 at 12:21 PM, Henrik Danielsson via arch-general <arch-general@archlinux.org> wrote:
I was not replying to anyone in particular. Gaetan? Sorry, you lost me there.
It may not have appeared in the same thread for you, but here we go [0] context, and the mail I was replying to was [1], for which the former applies.
cheers! mar77i
[0] http://www.mail-archive.com/arch-dev-public@archlinux.org/msg25123.html [1] http://www.mail-archive.com/arch-general@archlinux.org/msg43113.html
You are right, those did not show up as a thread for me, or even in my inbox. Thank you for that. I suppose I should have quoted part of the original message, but it was not on a list I'm not a subscriber on and the quote in Mauro's mail did not render as a quote normally does in Gmail, hence I was not sure it would render correctly if I messed up the copying it (HTML entities and all). I've never been able to make heads or tails of what mailing lists actually consider new or continued threads, or even navigating archives. :(
On Mon, 6 Mar 2017 11:39:33 +0100, Martin Kühne via arch-general wrote:
I know it's not directly an privacy issue, but I find it scary nonetheless... (especially since they expressed the wish to consolidate the data with other websites such as github).
This is exactly for the argument I was struggling to come up with. As far as I followed the discussion, this was the first time (I realized?) someone clearly disconnected the argument from the privacy discussion. Put this way, it makes sense to me, too. For the practical implications we'd hand over along, with a note of "do whatever you want with it we don't care". Turns out we do care what someone else does with the data.
This doesn't disconnect it from the privacy reasoning, this is a privacy issue, too. On Mon, 6 Mar 2017 11:26:25 +0100, Guus Snijders via arch-general wrote:
It's not that the information cannot be obtained otherwise, but I believe it makes a legal difference.
This is the whole point. It makes a difference to explicitly provide a lists under somebodies responsibility, that is isolated from the individual responsibility of the individual user and by doing this quasi to allow to use the lists for information processing. It's not allowed to download and misuse a photo that is published by a homepage. The photo is accessible for everybody, but not necessarily free for usage. Usernames are readable for everybody, but this doesn't implicate that it's allowed to use the usernames for information processing, it might also not be forbidden, it's just important, that the responsibility to have a username is by the user and to collect and process the data by the "researcher" and not by a third party providing a list.
Mauro Santos via arch-general <arch-general@archlinux.org> writes:
On 05-03-2017 13:35, Lukas Fleischer wrote:
Hi,
I was recently contacted by a Polish researcher asking for a list of AUR account names. I did not expect this to be controversial but a couple of Trusted Users raised concerns on IRC, so I decided to move this to the public mailing list and discuss the whole topic in generality. I would like to head more opinions but please read the whole email and give it a second thought before simply bringing up the usual privacy arguments mentioned below.
My original questions was: Are we fine with sharing the list of AUR accounts names (only user names, no real names or email addresses) with a researcher that seems trustworthy and agrees to not share the data in any form other than the resulting anonymized statistics?
In this particular case, we are talking about Dorota Celinska [1] from the University of Warsaw, Faculty of Economic Sciences [2], see [3] for a list of her publications and [4] for a summary of her research project funded recently by the Polish National Science Centre. She needs the list of user names to perform a segmentation analysis, including users which were active on the older AUR releases both do not show any activity on AUR 4. She would also like to use the user names as identifiers to establish connections with other platforms, such as GitHub.
The next question is: Would it make sense to even make this data publicly available? Would it make sense to extend our RPC interface such that one can search for users names? GitHub, for example, already provides such an interface [5]. Let me quickly summarize some arguments for this idea which came up on IRC:
* User names are mostly identifiers. It is questionable whether they can/should be considered personal/private information. Maybe this can only be answered by a lawyer, though.
* The user names of all accounts with any kind of public activity, like uploading a package, filing a request, writing a comment, are public already.
* After logging into the aurweb interface, you can already check whether an account with a given user name exists because the account details page URIs have the form https://aur.archlinux.org/account/$username. This means that for any platform providing a list of user names (such as GitHub), you can "establish connections" with the AUR already.
Now the arguments against:
* Principle of data economy: We should not share any kind of information we do not need to share.
* Sharing user names lowers the threshold for sharing other information which is considered more confidential.
* Users can (and should) already use crawlers to fetch the user names. For example, the user names of all package maintainers and comment authors appear on the package details pages. The names of all users filing package requests appear in the mailing list archives etc.
* We do not have ToS so we better not share anything.
I, personally, find the second last argument a very weak one. Telling users to build crawlers scraping an brute-forcing our HTML pages makes life difficult for both them and us. What do you think?
On the other side of the coin, the last argument is a very good one and it brings me to my last point. Independently of the outcome of this discussion, I think we should add some ToS that users need to agree upon when registering. It should contain information on liability and on privacy. Is anybody willing to write a draft? Do we need the support of a lawyer here?
Thank you for your time and have a nice Sunday!
Regards, Lukas
[1] http://coin.wne.uw.edu.pl/dcelinska/en/ [2] https://www.wne.uw.edu.pl/index.php/en/ [3] http://coin.wne.uw.edu.pl/dcelinska/en/pages/publications.html [4] https://ncn.gov.pl/sites/default/files/listy-rankingowe/2016-03-15/streszcze... [5] https://developer.github.com/v3/users/
I'd say err on the caution side and don't share, even though the usernames are public and easy to find by scraping them from the website/mailing list/etc, handing the whole database of usernames in a silver platter is a whole different story, which is what is being asked. Is there any community/website that provides a full list of registered usernames on request?
There is also the question of how useful that data would be, without any other data such as email the username list is useless, you have no guarantee that user foo on github is the same person as user foo on the AUR/Wiki/Forum or user foo somewhere else. In this case I'd also have to agree that sharing usernames lowers the threshold for sharing other information.
It also doesn't fit with their stated research goals, only github and projects associated with scraping data from github are mentioned, why would they want to throw the AUR usernames in the mix?
-- Mauro Santos
Hi all, Shall we focus on Lukas's questions?
My original questions was: Are we fine with sharing the list of AUR accounts names (only user names, no real names or email addresses) with a researcher that seems trustworthy and agrees to not share the data in any form other than the resulting anonymized statistics? → The first question: Are we fine with sharing the user names?
The next question is: Would it make sense to even make this data publicly available? Would it make sense to extend our RPC interface such that one can search for users names? GitHub, for example, already provides such an interface [5]. Let me quickly summarize some arguments for this idea which came up on IRC: → The second question: Would it make sense to even make this data publicly available?
I think we should add some ToS that users need to agree upon when registering. It should contain information on liability and on privacy. Is anybody willing to write a draft? Do we need the support of a lawyer here? → The third question: Shall we add some ToS that users need to agree upon when registering?
My opinions: 1. The first question: Are we fine with sharing the user names? I am fine. But I think some agreements should be made before sharing the data. 2. The second question: Would it make sense to even make this data publicly available? No, it is not OK. Please check this wiki [1]. Login name or nickname is Personally identifiable information (PII). 3. The third question: Shall we add some ToS that users need to agree upon when registering? Yes, it is better to have ToS. [1]: https://www.wikiwand.com/en/Personally_identifiable_information
On 03/06/2017 10:08 PM, YANG Ling via arch-general wrote:
Hi all, Shall we focus on Lukas's questions?
Yes, let's. [skipped - pointlessly quoted and then repeated questions]
My opinions:
1. The first question: Are we fine with sharing the user names? I am fine. But I think some agreements should be made before sharing the data.
There is no need to be fine or not, the user names are *already* public (with the exception of people who have never uploaded a package, left a comment, filed a package request, or indeed visibly interacted with the AUR in any way).
2. The second question: Would it make sense to even make this data publicly available? No, it is not OK. Please check this wiki [1]. Login name or nickname is Personally identifiable information (PII).
Okay... firstly, thanks for the strange Wikipedia proxy.... Stating a tautology does not advance this discussion. No one thought for a moment that usernames weren't somehow "personally identifying information". Lukas elaborated upon this question, by providing actual arguments for and against. By ignoring 90% of what he said, you are stripping the discussion of most meaningful context, and replacing it with some vague buzzwords.
3. The third question: Shall we add some ToS that users need to agree upon when registering? Yes, it is better to have ToS.
This wasn't even the question. Lukas said we should have a ToS, and he *asked* if anyone was willing to draft one. ... I really don't understand why people seem to have a paranoia issue with other people having an efficient interface to data that is already there. Researchers and Peeping Toms can already find out all this information by hitting the AUR server a lot and scraping HTML responses, offering the *same* data with less overhead can only serve to ease server congestion (on "our" end) and *time expended* reinventing the username list (on "their" end). Do we wish to penalize all researchers for the evil habit of extracting personally identifiable information, by making them slog through the process of compiling their information? Knowing full well that it won't actually stop them (for good or for ill)? Do we even owe anything to the relevant users? Since there is no ToS, an argument could be made that we all agreed to share whatever information we have in fact shared, without asking for qualifications about what the Arch Linux project intended to *do* with our usernames etc. (The usual IANAL applies.) tl;dr Let us emulate the forums, and provide a username list only accessible to logged-in AUR users. -- Eli Schwartz
On Mon, 6 Mar 2017 22:46:14 -0500, Eli Schwartz via arch-general wrote:
Let us emulate the forums, and provide a username list only accessible to logged-in AUR users.
So you recommend that AUR should deviate from the Arch related mailing lists. Note, mailman mailing list could be set up to "The subscribers list is only available to the list members" as done for https://lists.archlinux.org/listinfo/aur-general and then the members could decide on their own, if they are visible by this list [1]. However, https://lists.archlinux.org/listinfo/arch-general is set up to "The subscribers list is only available to the list administrator". So it was always the default and still is the default, not to provide such lists. There are sane reasons why those options are available, but we are at a point were nobody cares anymore about privacy and you recommend that providing such a list is appropriate. For what reason? Why not keeping what always was and still is default for Arch, simply to respect privacy? [1] "Conceal yourself from subscriber list? When someone views the list membership, your email address is normally shown (in an obscured fashion to thwart spam harvesters). If you do not want your email address to show up on this membership roster at all, select Yes for this option."
On Tue, 2017-03-07 at 06:48 +0100, Ralf Mardorf wrote:
On Mon, 6 Mar 2017 22:46:14 -0500, Eli Schwartz via arch-general wrote:
Let us emulate the forums, and provide a username list only accessible to logged-in AUR users.
So you recommend that AUR should deviate from the Arch related mailing lists. Note, mailman mailing list could be set up to "The subscribers list is only available to the list members" as done for https://lists.archlinux.org/listinfo/aur-general and then the members could decide on their own, if they are visible by this list [1]. However, https://lists.archlinux.org/listinfo/arch-general is set up to "The subscribers list is only available to the list administrator".
So it was always the default and still is the default, not to provide such lists. There are sane reasons why those options are available, but we are at a point were nobody cares anymore about privacy and you recommend that providing such a list is appropriate. For what reason? Why not keeping what always was and still is default for Arch, simply to respect privacy?
[1] "Conceal yourself from subscriber list?
When someone views the list membership, your email address is normally shown (in an obscured fashion to thwart spam harvesters). If you do not want your email address to show up on this membership roster at all, select Yes for this option."
PS: And btw. stop calling people who care about privacy "paranoid". I for example decided not to hide my membership and email address. Another question: Are you willing to become the one responsible for providing such a list in the legal sense?
Eli Schwartz via arch-general <arch-general@archlinux.org> writes:
On 03/06/2017 10:08 PM, YANG Ling via arch-general wrote:
Hi all, Shall we focus on Lukas's questions?
Yes, let's.
[skipped - pointlessly quoted and then repeated questions]
2. The second question: Would it make sense to even make this data publicly available? No, it is not OK. Please check this wiki [1]. Login name or nickname is Personally identifiable information (PII).
Okay... firstly, thanks for the strange Wikipedia proxy.... Oops, I forgot to paste the original wiki link. Wikiwand is a tool which can beautify Wikipedia page. If that bothers you, here is the original
Sorry, I'm not familiar with the rules here. I thought it is necessary to keep all original text. link: https://en.wikipedia.org/wiki/Personally_identifiable_information
tl;dr Let us emulate the forums, and provide a username list only accessible to logged-in AUR users.
Actually I am not *strongly* against it. I just feel it is not necessary to do it. Here is my logic: If one thing can make Archlinux better, then we do it. If one thing can make Archlinux worse, then we don't do it. If we cannot tell one thing make Archlinux better or worse, then it is not necessary to do it. In this case, what does Archlinux or its community gain from sharing usernames with researchers? I cannot see merits. Well, I am not Archlinux expert, nor Trust User. Maybe you can see merits which are not obvious to me. If most Trust Users feel it is OK to share usernames, I am fine with it. Don't get me wrong. I just simply hope Archlinux become better and better. -- Allen Yang
This discussion is pointless without legal advice. Without it disclosing user information (even if it is public) does not seem like such a good idea.
On 8 March 2017 at 20:57, Neven Sajko <nsajko@gmail.com> wrote:
This discussion is pointless without legal advice. Without it disclosing user information (even if it is public) does not seem like such a good idea.
Not that I advocate paying a lawyer just for this issue, it would be simpler to let her scrape AUR ;) ... But there probably should be some TOS ...
On Wed, 8 Mar 2017 21:02:24 +0100 Neven Sajko via arch-general <arch-general@archlinux.org> wrote:
On 8 March 2017 at 20:57, Neven Sajko <nsajko@gmail.com> wrote:
This discussion is pointless without legal advice. Without it disclosing user information (even if it is public) does not seem like such a good idea.
Not that I advocate paying a lawyer just for this issue, it would be simpler to let her scrape AUR ;)
... But there probably should be some TOS ...
IMO, why are we even discussing this. If she wants to do research, do it, set it up and scrape. Why should we compile a list and send it..? :S -- Joakim
On Wed, Mar 08, 2017 at 09:02:24PM +0100, Neven Sajko via arch-general wrote:
... But there probably should be some TOS ...
Why? Would a ToS be a legally binding document? If yes, it will constrain Arch, which is not good. If no, then it's just a meaningless text. I understand that companies have ToS because they want to cover their back legally, but Arch is different in this regard... Cheers, -- Leonid Isaev
participants (14)
-
Bennett Piater
-
Eli Schwartz
-
Guus Snijders
-
Henrik Danielsson
-
Joakim Hernberg
-
Leonid Isaev
-
Leonidas Spyropoulos
-
Martin Kühne
-
Mauro Santos
-
mike lojkovic
-
Neven Sajko
-
Ralf Mardorf
-
Tinu Weber
-
YANG Ling