[aur-dev] Safe and relatively reliable PKGBUILD parser.

Xyne xyne at archlinux.ca
Sat Jan 9 15:23:56 EST 2010


> It is quite a clever idea. I haven't seen this approach before. I  
> haven't looked at it thoroughly, but it looks like you're simply  
> sourcing the PKGBUILD with some trickery not to execute the code. Why  
> then the need for further parsing? Does `set` produce "raw" bash, e.g.  
> 'source=("https://localhost/$pkgname.tgz")'? It seems like bash should  
> be able to do it itself. If that were the case, the parser would be  
> extremely reliable (definitely more so than mine). There are still  
> some "safety" issues involved, although maybe not for your purposes.  
> One major thing is infinite loops - there's no way to break them. I'm  
> sure this parser will be very useful when such things aren't an issue.

You haven't fully understood how it works so I hope you don't mind if I
try to explain it again.

I first check the PKGBUILD with "/bin/bash -n PKGBUILD". If this
command exits without error then the PKGBUILD contains valid syntax,
most importantly it does not contain extra closing brackets ("}").

This lets me wrap the entire PKGBUILD in a function, e.g.
pkgbuild () {
<PKGBUILD>
}

I can then source the file with Bash without executing any code. The
previous check with "bash -n" guarantees that the PKGBUILD can not
escape the wrapping function. Because all code is inside a function,
sourcing the file does not execute any code at all.

Bash simply parses the file and stores the code itself in the
"pkgbuild" function, which itself contains other variables and
functions (e.g. package_foo, build). Because the code has not been
executed, the variables have not been expanded/interpolated and thus
still contain things such s "http://example.com/$pkgname-$pkgver.tar",
which is why it must still be intepolated by the parser.

The advantage of this method is that "set" will print out the
"pkgbuild" function and its contents in a canonical form, e.g. all
assignments to a variable are on a single line, if/then/else statements
follow a single format, etc.

This makes it possible to easily parse the assignments themselves, in
the order that they occur, without haing to consider all variations of
valid whitespace in statements. The parser simply needs to recognize
Bash syntax for things such as string substitutions, but this is a
relatively limted set so it is not difficult to handle all such cases.
The output of "set" also guarantees that you have a representation of
all variable assignments (in sequential order, and within their local
environment) so you have all the information that you need to
interpolate them. You could even handle command output if you wish,
using a command white-list to make sure that no trickery is used to run
malicious code.

Let me repeat that my method does not run any code in the
PKGBUILD. I've tested this by including an infinite loop at the top of
the file and it was not executed. I actually believe that this method
provides a perfectly safe and potentially very reliable method of
retrieving all metadata in the PKGBUILD with very little dependencies
and considerable portability.


Regards,
Xyne


More information about the aur-dev mailing list