[aur-dev] Safe and relatively reliable PKGBUILD parser.
Hi, There was no response on the pacman-dev list but someone here might find this potentially useful: http://mailman.archlinux.org/pipermail/pacman-dev/2010-January/010322.html It's written in Perl but it could easily be adapted to Python. It handles Bash variable interpolation and string substitutions so far. I'll add support for other PKGBUILD "tricks"as I come across them, e.g. setting dependencies by architecture. The post linked above includes simplified JSON output from the split kernel26 package. Regards, Xyne
On Fri, Jan 8, 2010 at 6:06 PM, Xyne <xyne@archlinux.ca> wrote:
Hi,
There was no response on the pacman-dev list but someone here might find this potentially useful:
http://mailman.archlinux.org/pipermail/pacman-dev/2010-January/010322.html
It's written in Perl but it could easily be adapted to Python. It handles Bash variable interpolation and string substitutions so far. I'll add support for other PKGBUILD "tricks"as I come across them, e.g. setting dependencies by architecture. The post linked above includes simplified JSON output from the split kernel26 package.
Regards, Xyne
What was the problem with that from Sebastian which was discussed earlier on the mailing lists, IRCs ? How does it know more ? Best Regards, Laszlo Papp
What was the problem with that from Sebastian which was discussed earlier on the mailing lists, IRCs ? How does it know more ?
I don't know. I wrote this because I needed a PKGBUILD parser in Perl for Bauerbill. Maybe it's better, maybe it's worse. I posted it here in case someone finds it useful per se, or wishes to take any of the ideas from it and use them to iimprove other parsers. Hmmm, after briefly reviewing the messages, I can mention that my parser: * doesn't depent on Yacc/Lex * supports split packages already * handles multiline assignments * supports interpolation and string substitutions
On 09/01/2010, at 2:50 AM, Xyne wrote:
What was the problem with that from Sebastian which was discussed earlier on the mailing lists, IRCs ? How does it know more ?
I don't know. I wrote this because I needed a PKGBUILD parser in Perl for Bauerbill. Maybe it's better, maybe it's worse. I posted it here in case someone finds it useful per se, or wishes to take any of the ideas from it and use them to iimprove other parsers.
It is quite a clever idea. I haven't seen this approach before. I haven't looked at it thoroughly, but it looks like you're simply sourcing the PKGBUILD with some trickery not to execute the code. Why then the need for further parsing? Does `set` produce "raw" bash, e.g. 'source=("https://localhost/$pkgname.tgz")'? It seems like bash should be able to do it itself. If that were the case, the parser would be extremely reliable (definitely more so than mine). There are still some "safety" issues involved, although maybe not for your purposes. One major thing is infinite loops - there's no way to break them. I'm sure this parser will be very useful when such things aren't an issue.
Hmmm, after briefly reviewing the messages, I can mention that my parser: * doesn't depent on Yacc/Lex * supports split packages already * handles multiline assignments * supports interpolation and string substitutions
For the record pkgparse does support split packages and word substitutions (though it's primitive atm, i.e. only $foo works, modifiers like ${foo##bar} don't). The major problem is with multiline assignments, but once that get's fixed, most PKGBUILDs should be parse- able. It probably won't depend on yacc/lex anymore either, but it will depend on Lemon/Ragel, as that's the direction it seems to be going :P. It's a compile-time dependency though, so it's not really a reason not to use it. To use it in perl you'd have to make perl bindings, which would require compilation anyway.
It is quite a clever idea. I haven't seen this approach before. I haven't looked at it thoroughly, but it looks like you're simply sourcing the PKGBUILD with some trickery not to execute the code. Why then the need for further parsing? Does `set` produce "raw" bash, e.g. 'source=("https://localhost/$pkgname.tgz")'? It seems like bash should be able to do it itself. If that were the case, the parser would be extremely reliable (definitely more so than mine). There are still some "safety" issues involved, although maybe not for your purposes. One major thing is infinite loops - there's no way to break them. I'm sure this parser will be very useful when such things aren't an issue.
You haven't fully understood how it works so I hope you don't mind if I try to explain it again. I first check the PKGBUILD with "/bin/bash -n PKGBUILD". If this command exits without error then the PKGBUILD contains valid syntax, most importantly it does not contain extra closing brackets ("}"). This lets me wrap the entire PKGBUILD in a function, e.g. pkgbuild () { <PKGBUILD> } I can then source the file with Bash without executing any code. The previous check with "bash -n" guarantees that the PKGBUILD can not escape the wrapping function. Because all code is inside a function, sourcing the file does not execute any code at all. Bash simply parses the file and stores the code itself in the "pkgbuild" function, which itself contains other variables and functions (e.g. package_foo, build). Because the code has not been executed, the variables have not been expanded/interpolated and thus still contain things such s "http://example.com/$pkgname-$pkgver.tar", which is why it must still be intepolated by the parser. The advantage of this method is that "set" will print out the "pkgbuild" function and its contents in a canonical form, e.g. all assignments to a variable are on a single line, if/then/else statements follow a single format, etc. This makes it possible to easily parse the assignments themselves, in the order that they occur, without haing to consider all variations of valid whitespace in statements. The parser simply needs to recognize Bash syntax for things such as string substitutions, but this is a relatively limted set so it is not difficult to handle all such cases. The output of "set" also guarantees that you have a representation of all variable assignments (in sequential order, and within their local environment) so you have all the information that you need to interpolate them. You could even handle command output if you wish, using a command white-list to make sure that no trickery is used to run malicious code. Let me repeat that my method does not run any code in the PKGBUILD. I've tested this by including an infinite loop at the top of the file and it was not executed. I actually believe that this method provides a perfectly safe and potentially very reliable method of retrieving all metadata in the PKGBUILD with very little dependencies and considerable portability. Regards, Xyne
On Sat 09 Jan 2010 21:23 +0100, Xyne wrote:
You haven't fully understood how it works so I hope you don't mind if I try to explain it again.
I first check the PKGBUILD with "/bin/bash -n PKGBUILD". If this command exits without error then the PKGBUILD contains valid syntax, most importantly it does not contain extra closing brackets ("}"). [...]
Wow this is quite clever. It definitely would make the job of parsing much easier. Thanks for the explanation.
Loui Chang wrote:
Wow this is quite clever. It definitely would make the job of parsing much easier. Thanks for the explanation.
:) I intend to flesh out the parser as special cases pop up. As already mentioned, there will be limits to what it can do depending on whether the packager uses command output to set variables, but perhaps Arch could eventually impose a de facto standard for PKGBUILDs using the parser itself as the standard, i.e. if the PKGBUILD metadata gets past the parser, the PKGBUILD itself would be considered invalid. In that case, the parser would support tricks such as [ "$ARCH" == "x86_64" ] && depends=('foo' 'bar') I want to be very clear that I am NOT suggesting that my parser become the standard, only that a parser based on this approach _could_ become one. Also note that this is really a method on its own that just happens to be implemented in Perl in this case. If you look at the code, you will see that it could very quickly be adapted to Python (and thus Django), or PHP, or just about anything.
On 10/01/2010, at 4:23, Xyne <xyne@archlinux.ca> wrote:
It is quite a clever idea. I haven't seen this approach before. I haven't looked at it thoroughly, but it looks like you're simply sourcing the PKGBUILD with some trickery not to execute the code. Why then the need for further parsing? Does `set` produce "raw" bash, e.g. 'source=("https://localhost/$pkgname.tgz")'? It seems like bash should be able to do it itself. If that were the case, the parser would be extremely reliable (definitely more so than mine). There are still some "safety" issues involved, although maybe not for your purposes. One major thing is infinite loops - there's no way to break them. I'm sure this parser will be very useful when such things aren't an issue. Bash simply parses the file and stores the code itself in the "pkgbuild" function, which itself contains other variables and functions (e.g. package_foo, build). Because the code has not been executed, the variables have not been expanded/interpolated and thus still contain things such s "http://example.com/$pkgname-$pkgver.tar", which is why it must still be intepolated by the parser.
It seems I did understand it, I just forgot assignments don't get interpreted. I suppose there's no way to get bash to execute the assignments but not the code? Perhaps filtering the function definitions from set output. I havn't looked at the output of set so, again, I'm shooting in the dark here.
The advantage of this method is that "set" will print out the "pkgbuild" function and its contents in a canonical form, e.g. all assignments to a variable are on a single line, if/then/else statements follow a single format, etc.
That is very handy indeed.
Let me repeat that my method does not run any code in the PKGBUILD. I've tested this by including an infinite loop at the top of the file and it was not executed. I actually believe that this method provides a perfectly safe and potentially very reliable method of retrieving all metadata in the PKGBUILD with very little dependencies and considerable portability.
Indeed. Perhaps Allan would be interested on this for his makepkg test suite, although maybe more in the concept since the test suite us in python.
Regards, Xyne
On Tue, Jan 12, 2010 at 07:52:54AM +0800, Sebastian Nowicki wrote:
Bash simply parses the file and stores the code itself in the "pkgbuild" function, which itself contains other variables and functions (e.g. package_foo, build). Because the code has not been executed, the variables have not been expanded/interpolated and thus still contain things such s "http://example.com/$pkgname-$pkgver.tar", which is why it must still be intepolated by the parser.
It seems I did understand it, I just forgot assignments don't get interpreted. I suppose there's no way to get bash to execute the assignments but not the code? Perhaps filtering the function definitions from set output. I havn't looked at the output of set so, again, I'm shooting in the dark here.
If you're willing to trust the variable declaration part of the PKGBUILD, then yeah it'd be easy to execute just that part. You don't even need to cut out the build() function, since executing the whole thing would only declare and not run that function. All you'd need to do is to add some "echo"s at the end of the wrapper function you've constructed, and execute the wrapper function. But this would essentially be the same as making a temp copy of the PKGBUILD, and adding some "echo"s at the end of the file, then sourcing the file. Xyne doesn't want to trust the PKGBUILDs that far, he's only using bash to format the code canonically, then he's regexping the canonical form instead of executing any of it. -- Jim Pryor profjim@jimpryor.net
If you're willing to trust the variable declaration part of the PKGBUILD, then yeah it'd be easy to execute just that part. You don't even need to cut out the build() function, since executing the whole thing would only declare and not run that function. All you'd need to do is to add some "echo"s at the end of the wrapper function you've constructed, and execute the wrapper function.
That doesnt work for overridden variables in split packages because they are set inside the packaging function(s). Anything which could selectively execute blocks of code inside of functions to get the values of the variables would be far more complicated than this approach and probably far more exploitable. Even without that to consider, you cannot blindly trust the variable declaration section of PKGBUILDs uploaded to the AUR.
On Tue, Jan 12, 2010 at 02:29:35PM +0100, Xyne wrote:
That doesnt work for overridden variables in split packages because they are set inside the packaging function(s).
Yes, right, good point. That answers a question I asked in another message.
Even without that to consider, you cannot blindly trust the variable declaration section of PKGBUILDs uploaded to the AUR.
Yes, exactly, that's why I was thinking of exploits your method might still be vulnerable to unless you take special steps to catch them. -- Jim Pryor profjim@jimpryor.net
Indeed. Perhaps Allan would be interested on this for his makepkg test suite, although maybe more in the concept since the test suite us in python.
It would be trivial to conver this to Python. I will probably do that myself if there seems to be enough interest in it.
On Sat, Jan 09, 2010 at 09:23:56PM +0100, Xyne wrote:
I first check the PKGBUILD with "/bin/bash -n PKGBUILD". If this command exits without error then the PKGBUILD contains valid syntax, most importantly it does not contain extra closing brackets ("}").
This lets me wrap the entire PKGBUILD in a function, e.g. pkgbuild () { <PKGBUILD> }
I can then source the file with Bash without executing any code. The previous check with "bash -n" guarantees that the PKGBUILD can not escape the wrapping function. Because all code is inside a function, sourcing the file does not execute any code at all.
Bash simply parses the file and stores the code itself in the "pkgbuild" function, which itself contains other variables and functions (e.g. package_foo, build). Because the code has not been executed, the variables have not been expanded/interpolated and thus still contain things such s "http://example.com/$pkgname-$pkgver.tar", which is why it must still be intepolated by the parser.
I was brainstorming to think of possible exploits. It looks like this is valid syntax: echo normal stuff exit 0 any funky stuff I want pkgver=#$#%$%%^&^$@#$$@^ } more funky stuff { Running bash -n on that gives 0. Now there's not necessarily anything wrong here---unless your parser doesn't stop parsing at the exit command. If it goes past that, then maybe exploits could be introduced, because we wouldn't be entitled to the assumption that the rest of the code is valid syntax. -- Jim Pryor profjim@jimpryor.net
I was brainstorming to think of possible exploits. It looks like this is valid syntax:
echo normal stuff exit 0 any funky stuff I want pkgver=#$#%$%%^&^$@#$$@^ } more funky stuff {
Running bash -n on that gives 0. Now there's not necessarily anything wrong here---unless your parser doesn't stop parsing at the exit command. If it goes past that, then maybe exploits could be introduced, because we wouldn't be entitled to the assumption that the rest of the code is valid syntax.
-- Jim Pryor
I haven't tested that but I don't think it would be an issue. As long as it doesn't break out of the function declaration, it shoulld work and afaik, you can include "exit" inside a function. I'm not a Bash expert though, so correct me if I'm wrong.
On Tue, Jan 12, 2010 at 02:20:36PM +0100, Xyne wrote:
I was brainstorming to think of possible exploits. It looks like this is valid syntax:
echo normal stuff exit 0 any funky stuff I want pkgver=#$#%$%%^&^$@#$$@^ } more funky stuff {
Running bash -n on that gives 0. Now there's not necessarily anything wrong here---unless your parser doesn't stop parsing at the exit command. If it goes past that, then maybe exploits could be introduced, because we wouldn't be entitled to the assumption that the rest of the code is valid syntax.
I haven't tested that but I don't think it would be an issue. As long as it doesn't break out of the function declaration, it shoulld work and afaik, you can include "exit" inside a function. I'm not a Bash expert though, so correct me if I'm wrong.
Yes, there's not necessarily a problem here. The wrapper function you construct would be syntactically valid and would execute fine. (It would exit whatever shell it's executed in.) I mention it to point out that you have to be careful regexping through the source (even the Bash-canonically-formatted source) of a function that Bash says is syntactically valid. There may be parts of that source that you assume to be valid Bash but which aren't, because they come after an exit. There may be other ways to do this too, e.g., with an exec. E.g.: TESTFILE=$(cat <<"EOF" echo normal stuff exit 0 any funky stuff I want pkgver=#$#%$%%^&^$@#$$@^ } more funky stuff { EOF ) WRAPPED=$'wrapper() {\n'"$TESTFILE"$'\n}' bash -n <<< "$WRAPPED" echo $? # prints 0 bash <<< "$WRAPPED"$'\ntype wrapper' prints: ======= wrapper is a function wrapper () { echo normal stuff; exit 0; any funky stuff I want; pkgver=#$#%$%%^ & ^$@#$$@^ } more funky stuff { } The garbage comes out in the output of `set` too, as well as `type wrapper`. Now if you try regexping that, without truly parsing it, you may if you're not careful find yourself processing the garbage at the end. That may make exploits possible. I like the approach you're using, but this loophole occurred to me and I thought you should know about it. You'll have to decide whether it's worth protecting against. But that makes salient the question of what advantage you're aiming to get by what you're doing rather than just checking the PKGBUILD with bash -n and if that works, sourcing it with some appended echos to print out the variables we're interested in. One possible advantage would be to protect against malicious stuff. If that's your aim, then you should be protecting against exploits like this may make possible. -- Jim Pryor profjim@jimpryor.net
Yes, there's not necessarily a problem here. The wrapper function you construct would be syntactically valid and would execute fine. (It would exit whatever shell it's executed in.)
I mention it to point out that you have to be careful regexping through the source (even the Bash-canonically-formatted source) of a function that Bash says is syntactically valid. There may be parts of that source that you assume to be valid Bash but which aren't, because they come after an exit. There may be other ways to do this too, e.g., with an exec.
The wrapper function is never executed so I don't see how this is an issue. The only possible issue would be that the parser would miss the exit and set variables that are set after it, but that's not really an issue either. If someone creates such a PKGBUILD, the PKGBUILD itself is invalid so it doesn't make any different if the parser assigns variables before or after the exit. This is basically the same as pointing out that if the PKGBUILD specifies an incorrect depends array then the parser will also specify an incorrect depends array. The point of the parser is to grab metadata from a PKGBUILD, not magically determine if the PKGBUILD itself is correct in every possible sense of the word. If the packager provides incorrect or nonsensical data then there is little than can be done. It's not an exploit though, because no malicious code is executed.
E.g.:
TESTFILE=$(cat <<"EOF" echo normal stuff exit 0 any funky stuff I want pkgver=#$#%$%%^&^$@#$$@^ } more funky stuff { EOF ) WRAPPED=$'wrapper() {\n'"$TESTFILE"$'\n}' bash -n <<< "$WRAPPED" echo $? # prints 0 bash <<< "$WRAPPED"$'\ntype wrapper'
prints: ======= wrapper is a function wrapper () { echo normal stuff; exit 0; any funky stuff I want; pkgver=#$#%$%%^ & ^$@#$$@^ } more funky stuff { }
The garbage comes out in the output of `set` too, as well as `type wrapper`.
Now if you try regexping that, without truly parsing it, you may if you're not careful find yourself processing the garbage at the end. That may make exploits possible.
I like the approach you're using, but this loophole occurred to me and I thought you should know about it. You'll have to decide whether it's worth protecting against.
Can you an example of an exploit using this method? I really can't think of any. If you are suggesting that the trailing "{" would throw off the parser then you are wrong. The output of the set, as I've stressed previously, is canonical. This permits the unequivocal parsing of function blocks because they are opened and closed on their own lines and thus code inside the block cannot both pass the validity check and appear on its own line.
But that makes salient the question of what advantage you're aiming to get by what you're doing rather than just checking the PKGBUILD with bash -n and if that works, sourcing it with some appended echos to print out the variables we're interested in. How can you source the packaging functions to get the variables nested in the packaging functions without executing the rest of the code inside those functions? How can you source potentially malicious PKGBUILDs safely?
One possible advantage would be to protect against malicious stuff. If that's your aim, then you should be protecting against exploits like this may make possible.
The only possible vulnerabilty of this method will be in the implementation of a function whilelist (e.g. uname -r). So far I see no other possible exploits which would allow someone to get the parser to run malicious code. Regards, Xyne
On Thu, Jan 14, 2010 at 12:28:32AM +0100, Xyne wrote:
The wrapper function is never executed so I don't see how this is an issue.
I know it's not executed. Actually it's being executed wouldn't be any problem, so far as this issue goes, because the shell would hit the exit.
The only possible issue would be that the parser would miss the exit and set variables that are set after it
Yes exactly.
but that's not really an issue either. If someone creates such a PKGBUILD, the PKGBUILD itself is invalid so it doesn't make any different if the parser assigns variables before or after the exit.
It depends on what your code does with the variables (now or in the future). I agree that nothing so far on that table must break here. But one might be _tempted_ to think additionally that the variables we extract with this method will only contain valid bash syntax. I'm saying that's not true (unless we took special steps to guard against these tricks). If one didn't see that it's not true, one might later try to evaluate some parts of those variables---e.g. what looks like a "$(uname ...)". And then exploits would threaten. But if all you'll ever be doing is getting Bash to format the function, and then thereafter *only* ever treating what you've got as text, never as code, yeah you're ok. -- Jim Pryor profjim@jimpryor.net
participants (5)
-
Jim Pryor
-
Laszlo Papp
-
Loui Chang
-
Sebastian Nowicki
-
Xyne