[aur-dev] [PATCH] Make external links in comments clickable (FS#20137).
This is a bit hacky patch to make links in AUR comments clickable (fixes FS#20137 [1]). Huge parts of this code are ripped from the DokuWiki plugin that is also used in Flyspray. I didn't have any time to test it extensively so I'd suggest to do some more tests if this will be commited. [1] https://bugs.archlinux.org/task/20137 --- web/lib/aur.inc | 45 +++++++++++++++++++++++++++++++++++++++++ web/template/pkg_comments.php | 2 +- 2 files changed, 46 insertions(+), 1 deletions(-) diff --git a/web/lib/aur.inc b/web/lib/aur.inc index bd69c4c..b0cfdc8 100644 --- a/web/lib/aur.inc +++ b/web/lib/aur.inc @@ -494,3 +494,48 @@ function salted_hash($passwd, $salt) } return md5($salt . $passwd); } + +function parse_link($matches) +{ + $name = $url = $matches[0]; + + if(substr($url, 0, 3) == 'ftp' && (substr($url, 0, 6) != 'ftp://')) { + $url = 'ftp://'.$url; + } + elseif (substr($url, 0, 3) == 'www') { + $url = 'http://'.$url; + } + + $url = str_replace('&', '&', $url); + $url = str_replace('&', '&', $url); + $url = strtr($url, array('>' => '%3E', '<' => '%3C', '"' => '%22')); + + return '<a href="' . $url . '">' . $name . '</a>'; +} + +function parse_comment($comment) +{ + $schemes = array('http', 'https', 'ftp'); + $ltrs = '\w'; + $gunk = '\/\#~:.?+=&%@!\-'; + $punc = '.:?\-;,'; + $host = $ltrs . $punc; + $any = $ltrs . $gunk . $punc; + + $patterns = array(); + + foreach ($schemes as $scheme) { + $patterns[] = '(\b(?i)' . $scheme . '(?-i):\/\/[' . $any . ']+?(?=[' . $punc . ']*[^' . $any . ']))'; + } + + $patterns[] = '(\b(?i)www?(?-i)\.[' . $host . ']+?\.[' . $host . ']+?[' . $any . ']+?(?=[' . $punc . ']*[^' . $any . ']))'; + $patterns[] = '(\b(?i)ftp?(?-i)\.['. $host . ']+?\.[' . $host . ']+?[' . $any . ']+?(?=[' . $punc . ']*[^' . $any . ']))'; + + $regex = '/' . implode('|', $patterns) . '/msS'; + + $comment = htmlspecialchars($comment); + $comment = preg_replace_callback($regex, parse_link, $comment . "\n"); + $comment = nl2br($comment); + + return $comment; +} diff --git a/web/template/pkg_comments.php b/web/template/pkg_comments.php index 02171a0..2ca9bf0 100644 --- a/web/template/pkg_comments.php +++ b/web/template/pkg_comments.php @@ -20,7 +20,7 @@ while (list($indx, $carr) = each($comments)) { ?> ?></div> <blockquote class="comment-body"> <div> -<?php echo nl2br(htmlspecialchars($carr['Comments'])) ?> +<?php echo parse_comment($carr['Comments']) ?> </div> </blockquote> <?php -- 1.7.3
On 09/30/2010 05:22 PM, Lukas Fleischer wrote:
What about the occurrences of "&(html-entity-code-here);" you produced the line before?
I am not that experienced with PHP, but this looks like the $patterns array got replaced instead of extended.
Won't this render the next instruction useless if there are html-characters in a link?
Generally I would not make hostnames ("www.foo.tld") clickable. If people are not able to provide proper URL's, they have a serious problem. (there is also the technical argument that the hostname is not a good indicator for the kind of service the host provides.) Regards, PyroPeter -- freenode/pyropeter "12:50 - Ich drücke Return."
On Thu, Sep 30, 2010 at 06:18:24PM +0200, PyroPeter wrote:
Nothing? Any occurrence of an HTML entity code is correctly encoded as "&". People shouldn't be able to manually insert HTML entities in comments. The first line is actually even superfluous as I realized just now since ampersands should already have been replaced by htmlspecialchars() before at the time this line is executed (didn't check that before, this part of code has been extracted from the DokuWiki plugin).
Nope, it doesn't. Check [1].
Nope. Links need to be escaped as well. Not sure what happens if a link contains quotes or "<"/">". This shouldn't happen too often tho.
Why not? What if you explicitly want to link to a project's home page? It'll also just convert hostnames if they start with a "www" or "ftp" subdomain, so comments refering to domains in other ways won't be converted. [1] http://www.php.net/manual/de/language.types.array.php#language.types.array.s...
On 09/30/2010 06:38 PM, Lukas Fleischer wrote:
Well, but you are encoding existing entities, that are not "&" as "&foo;". See the example below.
I see, "$var[] = foo" creates the array $var if necessary and appends foo.
Imo, you should split the message at the link boundaries. ( "foo ", "http://foo.bar.tld", " baz") Then you should encode the html-entities in all elements, wrap the links in <a>'s, and then join all that together. I can not think of a way to connect <a>-tags with proper encoding of the user input using just "normal" string functions like replace, or regexes. I would be happy if you could prove the opposite, and will help by providing you with input that breaks your system. == example 1 == input: "foo http://foo.tld/iLikeToUseApersands/foo&bar.html baz" If I am not mistaken, $regex would be "/http://foo.tld/iLikeToUseApersands/foo&bar.html/msS" (are the "/" correctly escaped? I will assume they are.) Then, $regex would be: "/http:\/\/foo\.tld\/iLikeToUseApersands\/foo&bar\.html/msS" $comment would be set by htmlspecialchars() to: "foo http://foo.tld/iLikeToUseApersands/foo&bar.html baz" => preg_replace_callback() would not match, as & got replaced.
You can also link to a homepage using valid URL's. The additional "feature" may be nice, but makes the code more complex. It also trains users to omit the "http://" and produces more work for devs, as they all now have to parse this invalid hostname+path stuff. Unrelated: You seem to accept only a-zA-Z in hostnames? Or does PHP's \w include 0-9 and language-dependent letters? What about underscores? Why does the <a>'s content only include the Path of the URL? Regards, PyroPeter -- freenode/pyropeter "12:50 - Ich drücke Return."
On Thu, Sep 30, 2010 at 08:56:56PM +0200, PyroPeter wrote:
Well, but you are encoding existing entities, that are not "&" as "&foo;". See the example below.
Yep, and that's how it's supposed to be. There shouldn't be any entities that users put in the comments and that are not encoded.
I see, "$var[] = foo" creates the array $var if necessary and appends foo.
Correct.
Yes... That would be cleaner, but also way more complicated to implement and would require huge amounts of code for making links clickable.
Why should it not work? preg_replace_callback() still matches if the URL contains a semicolon. This will be parsed and output a valid link (tested with current GIT version and patch applied).
Hm, that's a question of taste. We'll let Loui decide :p
"\w" in perl compatible regex includes all alphanumeric characters plus the underscore ("_").
Why does the <a>'s content only include the Path of the URL?
It doesn't. The "<a></a>"'s content contains excactly what the user typed (with special chars converted by htmlspecialchars()). Please don't just assume things but test your examples using a current GIT checkout with the patch applied in future.
On 10/01/2010 01:05 AM, Lukas Fleischer wrote:
Please don't just assume things but test your examples using a current GIT checkout with the patch applied in future.
I did not mean to offend you, and after applying the patch (which I should have done before sending the mails, you are right) your code in fact seems to work a lot better then I thought. While testing, I found a bug: Post this URL: http://foo.bar/<><>; It seems to trigger two bugs at once, first, the regex does not match whole URL, and second, the href is escaped twice. Regards, PyroPeter -- freenode/pyropeter "12:50 - Ich drücke Return."
On Fri, Oct 01, 2010 at 02:15:41PM +0200, PyroPeter wrote:
I didn't feel offended in any way, but reporting bugs that don't exist is just counterproductive and a waste of time.
I already said that there might be problems if the URL contains quotes or less-than/greater-than symbols in another mail [1]. This can be fixed by removing the first str_replace() (which I also proposed in the same mail) or by repeating the second str_replace() for """, "'", "<" and ">" (which might be even better from the perspective of security). However, I don't think such URLs will be a common use case. The second "bug" is expected behaviour, since punctuation marks at the end of URLs shouldn't be included in the URL itself (imagine someone putting a link at the end of a senctence). This is also how DokuWiki and Flyspray behave. If there really is an URL requiring a punctuation mark at the end of the URL (which there shouldn't be at all), this can be remarked in the comment itself. [1] http://mailman.archlinux.org/pipermail/aur-dev/2010-September/001263.html
On 09/30/2010 05:22 PM, Lukas Fleischer wrote:
I fixed the </> bugs and changed the indention to tabs. I also changed the regex to one that accepts everything that starts with one of 'http', 'https' or 'ftp' followed by a colon ":", contains no whitespace and ends with a letter (\w) or a slash "/". It also parses hostnames starting with "www." and ending in a 2 to 5 letters long TLD consisting of only a-z. I have tested it with the comments at [0]. Regards, PyroPeter [0] http://aur.keks.selfip.org/packages.php?ID=298 --- diff --git a/web/lib/aur.inc b/web/lib/aur.inc index bd69c4c..7465fac 100644 --- a/web/lib/aur.inc +++ b/web/lib/aur.inc @@ -494,3 +494,24 @@ function salted_hash($passwd, $salt) } return md5($salt . $passwd); } + +function parse_link($matches) +{ + $name = $matches[0]; + $url = preg_replace('/^www\./', 'http://www.', $matches[0]); + return '<a href="' . $url . '">' . $name . '</a>'; +} + +function parse_comment($comment) +{ + $regex = '/' . str_replace('/', '\/', implode('|',array( + '\b(http|https|ftp):[^\s]*[\w\d/]', + '\bwww\.[^\s/]+\.[a-z]{2,5}' + ))) . '/S'; + + $comment = htmlspecialchars($comment); + $comment = preg_replace_callback($regex, 'parse_link', $comment); + $comment = nl2br($comment); + + return $comment; +} diff --git a/web/template/pkg_comments.php b/web/template/pkg_comments.php index 02171a0..2ca9bf0 100644 --- a/web/template/pkg_comments.php +++ b/web/template/pkg_comments.php @@ -20,7 +20,7 @@ while (list($indx, $carr) = each($comments)) { ?> ?></div> <blockquote class="comment-body"> <div> -<?php echo nl2br(htmlspecialchars($carr['Comments'])) ?> +<?php echo parse_comment($carr['Comments']) ?> </div> </blockquote> <?php -- freenode/pyropeter "12:50 - Ich drücke Return."
On Fri, Oct 01, 2010 at 04:59:22PM +0200, PyroPeter wrote:
This won't match URLs like "https://aur.archlinux.org/packages.php?O=0&K=" and an ampersand at the end of an URL won't be converted correctly :/ I'll try to implement it a more proper way the next days. Maybe I'll actually go with splitting comments at link boundaries as you suggested before... :)
On 10/01/2010 05:52 PM, Lukas Fleischer wrote:
Well, that's the problem. Which characters should belong to the end of the URL, and which should not? There could also be cases in which punctuation belongs to the URL. If punctuation is parsed as not belonging to the URL, there would be no way to post a working link to certain URLs. If punctuation is parsed as part of the URL, one could insert a space between the URL and the punctuation that should not belong to the URL. One should also consider that inserting an URL into a sentence looks horrible and is normally not done (by me, at least). About splitting at boundaries: Contrary to what I have said before, using regular expressions seems to be a valid and efficient way. (I thought you would have to escape tag-content and attributes in different ways (percent-encoding vs. html-entities). After reading the HTML4 specification I realized this is not the case, as content and attributes are both escaped using html-entities) Regards, PyroPeter -- freenode/pyropeter "12:50 - Ich drücke Return."
On Fri, Oct 01, 2010 at 06:23:06PM +0200, PyroPeter wrote:
Using regular expressions is an efficient way, but they should be applied before htmlspecialchars() or anything similar is applied. E.g. we could use preg_match() or preg_match_all() with PREG_OFFSET_CAPTURE to get the positions of all links, then call a function, that converts links and converts the stings as necessary, and convert the parts that don't contain any links separately using htmlspecialchars().
On Fri, Oct 01, 2010 at 06:23:06PM +0200, PyroPeter wrote:
I didn't read the whole thread but as far as I understand you're searching for a proper solution how to correctly find urls in comments. John Gruber's Regex seems quite right for this: http://daringfireball.net/2010/07/improved_regex_for_matching_urls Does this help? Jan-Erik (badboy_)
Comments are now split at link boundaries and links are converted separately. I find this to be a much cleaner way than re-converting comments that have already been converted using htmlspecialchars(). This also doesn't require any callback procedure. --- web/lib/aur.inc | 24 ++++++++++++++++++++++++ web/template/pkg_comments.php | 2 +- 2 files changed, 25 insertions(+), 1 deletions(-) diff --git a/web/lib/aur.inc b/web/lib/aur.inc index bd69c4c..a6292ca 100644 --- a/web/lib/aur.inc +++ b/web/lib/aur.inc @@ -494,3 +494,27 @@ function salted_hash($passwd, $salt) } return md5($salt . $passwd); } + +function parse_comment($comment) +{ + $url_pattern = '/(\b(?:https?|ftp):\/\/[\w\/\#~:.?+=&%@!\-;,]+?' . + '(?=[.:?\-;,]*(?:[^\w\/\#~:.?+=&%@!\-;,]|$)))/iS'; + + $matches = preg_split($url_pattern, $comment, -1, + PREG_SPLIT_DELIM_CAPTURE); + + $html = ''; + for ($i = 0; $i < count($matches); $i++) { + if ($i % 2) { + # convert links + $html .= '<a href="' . htmlspecialchars($matches[$i]) . + '">' . htmlspecialchars($matches[$i]) . '</a>'; + } + else { + # convert everything else + $html .= nl2br(htmlspecialchars($matches[$i])); + } + } + + return $html; +} diff --git a/web/template/pkg_comments.php b/web/template/pkg_comments.php index 02171a0..2ca9bf0 100644 --- a/web/template/pkg_comments.php +++ b/web/template/pkg_comments.php @@ -20,7 +20,7 @@ while (list($indx, $carr) = each($comments)) { ?> ?></div> <blockquote class="comment-body"> <div> -<?php echo nl2br(htmlspecialchars($carr['Comments'])) ?> +<?php echo parse_comment($carr['Comments']) ?> </div> </blockquote> <?php -- 1.7.3.1
participants (4)
-
Jan-Erik Rediger
-
Loui Chang
-
Lukas Fleischer
-
PyroPeter