[arch-general] PKGBUILD parser
Well, my second rewrite seems also come to a dead-end, and before i rewrite this again, i was hoping someone here could give me tips on what would be the best method to parse a PKGBUILD file ? you can play with my latest fail here: http://osku.de/dump/pkgbuild.js/test-pkgbuild.html @todo parseBuild() fails if you have } in body @todo some array content in PKGBUILD mix ' and ", so this fails ATM... (and the test-pkgbuild.html somehow doesn't parse build() on every second run...) cheers .andre
On 05/09/2010 05:53 AM, Andre "Osku" Schmidt wrote:
Well,
my second rewrite seems also come to a dead-end, and before i rewrite this again, i was hoping someone here could give me tips on what would be the best method to parse a PKGBUILD file ?
you can play with my latest fail here: http://osku.de/dump/pkgbuild.js/test-pkgbuild.html
@todo parseBuild() fails if you have } in body @todo some array content in PKGBUILD mix ' and ", so this fails ATM... (and the test-pkgbuild.html somehow doesn't parse build() on every second run...)
cheers .andre
Is this something that's strictly intended for the web? Because you could just source it, and that would only leave you with the comments to inspect.
On 09/05/10 22:35, Matthew Monaco wrote:
On 05/09/2010 05:53 AM, Andre "Osku" Schmidt wrote:
Well,
my second rewrite seems also come to a dead-end, and before i rewrite this again, i was hoping someone here could give me tips on what would be the best method to parse a PKGBUILD file ?
you can play with my latest fail here: http://osku.de/dump/pkgbuild.js/test-pkgbuild.html
@todo parseBuild() fails if you have } in body @todo some array content in PKGBUILD mix ' and ", so this fails ATM... (and the test-pkgbuild.html somehow doesn't parse build() on every second run...)
cheers .andre
Is this something that's strictly intended for the web? Because you could just source it, and that would only leave you with the comments to inspect.
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It also fails with package splitting... Allan
On Sun, May 9, 2010 at 2:44 PM, Allan McRae <allan@archlinux.org> wrote:
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It also fails with package splitting...
Makes me wonder why pkgbuilds are written in bash. Sounds like a big design flaw. But it depends on what our needs are : 1) we don't care about untrusted source or security, we always trust the source, then bash sourcing is very convenient (original idea behind that design) 2) we care about security and dealing with untrusted source in a secure way : the existing format sucks Currently we are neither in 1), nor in 2), we are somewhere in the middle with the inconvenient of both sides. We lost the convenience of 1) bash sourcing with package splitting. (I've been meaning to fix this for one year or so, just never got to it). And we don't have any ideas about how we could ever suit 2). Changing pkgbuild format doesn't sound really doable and realistic, it might be the most important characterization of what Arch is, changing it would make a new distrib. But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production. To re-iterate : PKGBUILD format was meant to be easy to parse by using bash source. The moment you stop using bash source, it's just all wrong, and it's the format you have to change.
Just to let you know dude, you can't parse that with a regular expression. A regular expression is modeled / parsed by a finite automaton = a state machine with a finite number of states. Braces allow nesting which creates a source with potentially an infinite number of states consider, a() { echo 1; b() { echo 2; }; } Potentially I could next expressions like that endlessly. A regular expression will never be able to parse that.because it can never decide which brace is the final one. This might be better explained here. http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to... Kaiting. On Sun, May 9, 2010 at 10:21 AM, Xavier Chantry <chantry.xavier@gmail.com>wrote:
On Sun, May 9, 2010 at 2:44 PM, Allan McRae <allan@archlinux.org> wrote:
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It
also
fails with package splitting...
Makes me wonder why pkgbuilds are written in bash. Sounds like a big design flaw.
But it depends on what our needs are : 1) we don't care about untrusted source or security, we always trust the source, then bash sourcing is very convenient (original idea behind that design) 2) we care about security and dealing with untrusted source in a secure way : the existing format sucks
Currently we are neither in 1), nor in 2), we are somewhere in the middle with the inconvenient of both sides. We lost the convenience of 1) bash sourcing with package splitting. (I've been meaning to fix this for one year or so, just never got to it).
And we don't have any ideas about how we could ever suit 2). Changing pkgbuild format doesn't sound really doable and realistic, it might be the most important characterization of what Arch is, changing it would make a new distrib. But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production.
To re-iterate : PKGBUILD format was meant to be easy to parse by using bash source. The moment you stop using bash source, it's just all wrong, and it's the format you have to change.
-- Kiwis and Limes: http://kaitocracy.blogspot.com/
On Sun, May 9, 2010 at 4:57 PM, Kaiting Chen <kaitocracy@gmail.com> wrote:
Just to let you know dude, you can't parse that with a regular expression. A regular expression is modeled / parsed by a finite automaton = a state machine with a finite number of states. Braces allow nesting which creates a source with potentially an infinite number of states consider,
a() { echo 1; b() { echo 2; }; }
Potentially I could next expressions like that endlessly. A regular expression will never be able to parse that.because it can never decide which brace is the final one. This might be better explained here.
http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to...
thank you. i'll continue my journey on parsing this another way.
"Andre "Osku" Schmidt" <andre.osku.schmidt@googlemail.com> a écrit :
A regular expression will never be able to parse that.because it can never decide which brace is the final one. This might be better explained here.
http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to...
thank you. i'll continue my journey on parsing this another way.
Actually if you read the page linked above completely you will notice that it says that you can. Regexps like POSIX that use finite automata can't but PCRE (that are everywhere) can, at least recent versions. That's also why they are slower. -- catwell (from mobile phone)
On Mon, May 10, 2010 at 1:14 PM, Pierre Chapuis <catwell@archlinux.us> wrote:
"Andre "Osku" Schmidt" <andre.osku.schmidt@googlemail.com> a écrit :
A regular expression will never be able to parse that.because it can never decide which brace is the final one. This might be better explained here.
http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to...
thank you. i'll continue my journey on parsing this another way.
Actually if you read the page linked above completely you will notice that it says that you can. Regexps like POSIX that use finite automata can't but PCRE (that are everywhere) can, at least recent versions. That's also why they are slower.
yes via a "recursive" expression. another option would be to use a multipass setup (i havent looked at the OP's code) to break the problem into smaller chunks instead of trying to to it all in one expression (i.e. use an expression to count the braces/etc. and build another expression dynamically based off the results of the first)
Interesting I didn't realize that. But then it's not really a 'regular' expression then. They should call it a 'limited-context-free' expression... Kaiting. On Mon, May 10, 2010 at 2:21 PM, C Anthony Risinger <anthony@extof.me>wrote:
On Mon, May 10, 2010 at 1:14 PM, Pierre Chapuis <catwell@archlinux.us> wrote:
"Andre "Osku" Schmidt" <andre.osku.schmidt@googlemail.com> a écrit :
A regular expression will never be able to parse that.because it can never decide which brace is the final one. This might be better explained here.
thank you. i'll continue my journey on parsing this another way.
Actually if you read the page linked above completely you will notice
http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to... that it says that you can. Regexps like POSIX that use finite automata can't but PCRE (that are everywhere) can, at least recent versions. That's also why they are slower.
yes via a "recursive" expression. another option would be to use a multipass setup (i havent looked at the OP's code) to break the problem into smaller chunks instead of trying to to it all in one expression (i.e. use an expression to count the braces/etc. and build another expression dynamically based off the results of the first)
-- Kiwis and Limes: http://kaitocracy.blogspot.com/
On Mon, May 10, 2010 at 8:14 PM, Pierre Chapuis <catwell@archlinux.us> wrote:
http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to...
Actually if you read the page linked above completely you will notice that it says that you can. Regexps like POSIX that use finite automata can't but PCRE (that are everywhere) can, at least recent versions. That's also why they are slower.
oh, indeed, thanks! i somehow skipped all those and went straight to read about CFG/PEG... here's what seems to work in php (and i assume in many server side languages): /^build\s*\(\)\s*(\{((?>[^{}]+)|(?R1))*\})/mx sadly it (flag x) didn't work in any browser (js) i tested... cheers .andre
On Sun 09 May 2010 16:21 +0200, Xavier Chantry wrote:
On Sun, May 9, 2010 at 2:44 PM, Allan McRae <allan@archlinux.org> wrote:
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It also fails with package splitting...
But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production.
Yeah I've thought about this as well. Source packages could have a similar format as binary packages with a .PKGINFO file to present the metadata in an easily parsable format. You can read some of my incomplete brainstormings here: http://louipc.mine.nu/arch/%5BRFC%5D-PKGINFO-in-srctargz
On Sun, May 9, 2010 at 6:06 PM, Loui Chang <louipc.ist@gmail.com> wrote:
Yeah I've thought about this as well. Source packages could have a similar format as binary packages with a .PKGINFO file to present the metadata in an easily parsable format.
You can read some of my incomplete brainstormings here: http://louipc.mine.nu/arch/%5BRFC%5D-PKGINFO-in-srctargz
Ah, that's great, never heard of that before ! A few comments : - using the PKGINFO format sounds like a good idea, but not sure why you want to keep the same name. As you noticed yourself, this would cause stupid problems like a possible confusion between source and package tarballs. Better just call it SRCINFO, so pacman will never be confused. - for split pkgbuilds and arch , well... Maybe it would be simpler to write as many SRCINFO as there are PKGINFO/packages , i.e. one for every combination of split name / arch. Maybe all these files could be all combined into just one, I am not sure. But I would not care about data duplication, I would rather keep it as dummy and easy to parse as possible.
Hello, On Sun, May 09, 2010 at 12:06:34PM -0400, Loui Chang wrote:
On Sun 09 May 2010 16:21 +0200, Xavier Chantry wrote:
On Sun, May 9, 2010 at 2:44 PM, Allan McRae <allan@archlinux.org> wrote:
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It also fails with package splitting...
But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production.
Yeah I've thought about this as well. Source packages could have a similar format as binary packages with a .PKGINFO file to present the metadata in an easily parsable format. The idea of a separate file only for parsing metadata is pretty good. The functions are not needed for the metainformation of the src.tar.gz. In pkgman I cut the functions out of the PKGBUILD and source only the remaining variables: " # get rid of all functions (from first appearing function to EOF) and empty lines in $_BuildScript sed -i -e "/[[:space:]]*()[[:space:]]*[^}]/,$ d" -e "/^[[:space:]]*pkgname=/,$ { /^$/d; }" ${__TMPPKGBUILD} bash -n ${__TMPPKGBUILD} && source ${__TMPPKGBUILD} && rm --force ${__TMPPKGBUILD} || error "blablabla" "
Sure, if one is really malevolent, he can add a var like "_iamevil=$(rm -fr ${HOME})". But this is a common sourcing problem which one has with all script languages. The problem with additional files is the one debian has. The control.tar.gz in debian packages contains multiple files, which provide almost no information. So most of these files are useless. But I think this is not intended in Arch. However, Xyne made a function based information parser, which I actually didn't understand. It would be nice if Xyne could explain his ideas more detailed and give some hints how to use it with bash.
You can read some of my incomplete brainstormings here: http://louipc.mine.nu/arch/%5BRFC%5D-PKGINFO-in-srctargz
--
On 10/05/10 02:06, Loui Chang wrote:
On Sun 09 May 2010 16:21 +0200, Xavier Chantry wrote:
On Sun, May 9, 2010 at 2:44 PM, Allan McRae<allan@archlinux.org> wrote:
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It also fails with package splitting...
But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production.
Yeah I've thought about this as well. Source packages could have a similar format as binary packages with a .PKGINFO file to present the metadata in an easily parsable format.
You can read some of my incomplete brainstormings here: http://louipc.mine.nu/arch/%5BRFC%5D-PKGINFO-in-srctargz
I am told I like to be really negative anytime this is bought up... it is not deliberate, I just see the barriers to this working. So here we go! I know you have pointed out some problems already and this is related. makepkg does not actually parse any of the splitpkg overrides until build time. How do we get the packaging variable overrides without actually making the package (and on every architecture)? We would need to extract the needed fields from the package functions somehow. So that brings us back to needing to hack a bash parser in makepkg or to actually require the package building to take place before you can create a source package. And this is not restricted to package splitting... e.g. pkgname=foo ... # depends not needed at make time # depends=('bar') ... package() { depends=('bar') } Welcome to the world of makepkg hacks... And do not think such hacks are not used. The old klibc PKGBUILD generated a provides array in the build function on the basis of a file name only available at the end of the build process. The joy of PKGBUILDs is that they are so flexible. The problem with PKGBUILDs is that they are so flexible. Allan
On Mon, May 10, 2010 at 1:23 AM, Allan McRae <allan@archlinux.org> wrote:
On 10/05/10 02:06, Loui Chang wrote:
On Sun 09 May 2010 16:21 +0200, Xavier Chantry wrote:
On Sun, May 9, 2010 at 2:44 PM, Allan McRae<allan@archlinux.org> wrote:
Sourcing is dangerous if the PKGBUILD is from an untrusted source. It also fails with package splitting...
But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production.
Yeah I've thought about this as well. Source packages could have a similar format as binary packages with a .PKGINFO file to present the metadata in an easily parsable format.
You can read some of my incomplete brainstormings here: http://louipc.mine.nu/arch/%5BRFC%5D-PKGINFO-in-srctargz
I am told I like to be really negative anytime this is bought up... it is not deliberate, I just see the barriers to this working. So here we go! I know you have pointed out some problems already and this is related.
makepkg does not actually parse any of the splitpkg overrides until build time. How do we get the packaging variable overrides without actually making the package (and on every architecture)? We would need to extract the needed fields from the package functions somehow. So that brings us back to needing to hack a bash parser in makepkg or to actually require the package building to take place before you can create a source package. And this is not restricted to package splitting...
e.g.
pkgname=foo ... # depends not needed at make time # depends=('bar') ... package() { depends=('bar') }
Welcome to the world of makepkg hacks... And do not think such hacks are not used. The old klibc PKGBUILD generated a provides array in the build function on the basis of a file name only available at the end of the build process.
The joy of PKGBUILDs is that they are so flexible. The problem with PKGBUILDs is that they are so flexible.
The biggest problem indeed comes from any variables that are declared inside a function. Well, it's easy, let's just make a rule to forbid it. Any AUR packager who breaks the rule will have its package data messed up in the AUR interface. Too bad for him/her. The klibc package is/was an exception, not the rule, and it wasn't on AUR so less problematic (still problematic for other tools like my python check-packages for integrity check, but well). So the main thing is split variables that need to be moved top-level. Dan, Aaron and I had some proposals / examples how to deal with that. You were included in the few mail exchanges we had but I am not sure if you did receive all of them as you didn't reply directly in that thread, I will forward it to you.
my journey ended here: http://en.wikipedia.org/wiki/Parsing_expression_grammar i tried couple hours to understand how to use the two js librarys that are listed there... but yeah, i have no expertise in that (PEG) field and couldn't find any newbie friendly tutorials on that subject... so another freezed/dead project on my side. .andre ps. i just wanted to experiment on making a web gui for editing PKGBUILD files. you know, with validation, completion, "help bubbles", whistles&bells, etc...
On Mon 10 May 2010 09:23 +1000, Allan McRae wrote:
On 10/05/10 02:06, Loui Chang wrote:
On Sun 09 May 2010 16:21 +0200, Xavier Chantry wrote:
But I just had an idea now, if we're thinking about AUR use case : makepkg --source could generate a suitable and parsable file providing all information that AUR needs, and ships that next to the PKGBUILD in the source tarball. Does that sound crazy ? This would not fix the problem now, but it could fix it eventually, when most pkgbuilds are re-submitted. Or this parsable file could be generated for all pkgbuilds in a row, just for the conversion, in a chroot/jail on a machine not in production.
Yeah I've thought about this as well. Source packages could have a similar format as binary packages with a .PKGINFO file to present the metadata in an easily parsable format.
You can read some of my incomplete brainstormings here: http://louipc.mine.nu/arch/%5BRFC%5D-PKGINFO-in-srctargz
I am told I like to be really negative anytime this is bought up... it is not deliberate, I just see the barriers to this working. So here we go! I know you have pointed out some problems already and this is related.
No problem. I didn't really share this before because I hadn't even thought of a real solution. Since it was mentioned though, I thought I'd share my thoughts. There are definitely many barriers to sort out.
makepkg does not actually parse any of the splitpkg overrides until build time. How do we get the packaging variable overrides without actually making the package (and on every architecture)? We would need to extract the needed fields from the package functions somehow. So that brings us back to needing to hack a bash parser in makepkg or to actually require the package building to take place before you can create a source package. And this is not restricted to package splitting...
e.g.
pkgname=foo ... # depends not needed at make time # depends=('bar') ... package() { depends=('bar') }
Welcome to the world of makepkg hacks... And do not think such hacks are not used. The old klibc PKGBUILD generated a provides array in the build function on the basis of a file name only available at the end of the build process.
Yeah there'd have to be some kind of standard constructs for all these kinds of hacks like platform specific dependencies, etc. That would probably mean changing or expanding the PKGBUILD spec. I wouldn't be afraid to do that, but it might not sit well with compatibility or with Arch principles.
The joy of PKGBUILDs is that they are so flexible. The problem with PKGBUILDs is that they are so flexible.
Indeed.
participants (9)
-
Allan McRae
-
Andre "Osku" Schmidt
-
C Anthony Risinger
-
Kaiting Chen
-
Loui Chang
-
Matthew Monaco
-
Pierre Chapuis
-
vlad
-
Xavier Chantry