[pacman-dev] [RFC] Package parser in python

Sebastian Nowicki sebnow at gmail.com
Sun Dec 13 02:59:53 EST 2009

On 12/12/2009, at 10:11 PM, Allan McRae wrote:

> Sebastian Nowicki wrote:
>> As you may have heard, I started a proper PKGBUILD parser[1], which  
>> parses according to shell semantics and does a little interpreting.  
>> I just released the first version, which doesn't handle errors, or  
>> multi-line values (like arrays or escaped newlines) very well. It  
>> does however support split packages. I'm in the process of  
>> modifying parched to essentially turn it into python bindings[2]  
>> for pkgparse.
>> You probably already have a parser at this point, so I'm not sure  
>> how useful this would be to you (it might be overkill anyway), I  
>> just though I'd let you know.
>> [1]: http://github.com/sebnow/pkgparse
>> [2]: http://github.com/sebnow/parched/tree/pkgparse_pyrex
> Looks interesting.  I will take it for a spin later. I assume this  
> is going towards AUR2?
> I had not done any further work on my parser as I was uncertain what  
> was the best way to go in developing a makepkg test suite.  Given  
> the makepkg test suite will use a safe set of PGKBUILDs, I was  
> thinking of just using bash to parse them.

That would probably be the simplest and most accurate way (you don't  
want bugs in the parser to fail tests). The namcap method is a good  
way to go. You could look at the current AUR2 implementation to get  
information (it sources the PKGBUILD and spits out python code)[1].

It is indeed for AUR2. I thought that since you were contemplating a  
PKGBUILD parser, this might be of use to you.

On 12/12/2009, at 11:44 PM, Xavier wrote:

> I can't help but think this whole situation is stupid.
> I would suppose that PKGBUILDs were written in bash for simplicity
> reason : makepkg just needs to source them, and that's it. Whole
> parsing done for free.
> And now we realize that when using untrusted source, we cannot do that
> anymore. And now we basically have to rewrite a bash parser from
> scratch. I mean, it's hard to imagine a more flawed design, and more
> complex solution to a simple problem.

It is, but as Allan pointed out, makepkg's goals are different. AUR  
effectively uses PKGBUILDs in an unconventional way.

> Somehow we manage to go from a very KISS solution to a completely  
> anti-KISS one.
> I only see two solutions :
> - we keep using bash, but try to do that in the most restricted
> environment possible (e.g. namcap way , or maybe there is something
> even more restrictive and secure ?)

I don't think it's possible to make the bash environment any more  
restrictive than namcap's way. There are fundamental things about bash  
that make it insecure for blind parsing. The first example that comes  
to mind are infinite loops. They may not be deliberate, or occur  
often, but it's a possibility. The others regard various utilities  
which are required for a sane build environment (mv, install, cat,  
etc) but could possibly be dangerous when used maliciously. I recall  
Callan (or someone else) provided a nice example of this on the aur- 
dev mailing list. It's probably lost now.

A custom parser omits these problems by not executing commands (and by  
inference possibly getting inaccurate metadata, e.g. sed) and  
restrictive language constructs like loops (maximum amount of  
iterations before an error is raised). There are of course pros and  
cons to both approaches.

>> It is not particularly possible given all the bash "tricks" used in
> Why is that ? Things like calling external commands and such ?
> And would the alternative parser support these bash tricks anyway ?
> So why don't we just forbid / not support them ?

Some "tricks" are required for the PKGBUILDs to be sane and remove  
duplication. One particular trick that comes to mind is: if [ "$CARCH"  
= "x86_64"]; then ... fi. This is something that is very common with  
binary sources, and the PKGBUILD format has no native support for it.  
Removing support for it would require two PKGBUILDs, almost identical.  
Due to things like this, a bit of interpretation is required for the  
parser to work properly. I plan on supporting conditionals like these  
in pkgparse.

I have also seen for loops populating fields (I think in the kernel  
package). These tricks should not be supported as it is due to the  
laziness of the maintainer.

Essentially, for building packages, or extracting metadata from safe  
packages, using bash directly is the way to go. For environments where  
arbitrary packages are parsed and security is an issue, a parser is  
currently required.

> - we decide that pkgbuild format is a flawed design, and was too
> limited for our needs, and switch to a new one (in which case Xyne's
> brainstorming could help : http://xyne.archlinux.ca/ideas/pkgmeta )

I have been thinking of designing a new package metadata format  
myself, one that is more universal (i.e. not specific to pacman). I  
doubt this is achievable, but it's an interesting experiment. The idea  
is mostly inspired by the rockspec format for luarocks[2]. It seems  
very flexible and provides lots of data. The Lua syntax resembles  
JSON, so that format can be used instead. YAML could also be used to  
make it "prettier" and even more flexible. Anyway that's a discussion  
for another thread. A new format is unlikely to be used by Archlinux  
any time soon.

Sorry for going a little off-topic there.

[1]: http://github.com/sebnow/aur2/blob/1ad21387a58eab0d3cf4296f08b6edee0a1e27d2/archlinux/aur/Package/parsepkgbuild.sh
[2]: http://www.luarocks.org/en/Creating_a_rock

More information about the pacman-dev mailing list