Archive for parsing

Parsing and stuff

Posted in Uncategorized with tags , , on December 1, 2008 by Bartosz Radaczyński

So, for the past two weeks or so I’ve been trying to get this small python thingy up and running. But (as they always do) this “small thingy” suddenly turned into piles and piles of code. I guess that this is what they mean, when they make out rules like “when a programmer gives you an estimate, add 1 and take the unit immediately larger than the one given” (so that two weeks means actually three months).

So, anyway, this little project started off as a code analyser for cobol programs + db2 sql. It was meant to provide some sort of data flow analysis. I all seemed pretty straightforward and the idea of making that kind of analysis on 300 programs blew my mind, so I figured that an automated tool would do a much better job than I even would doing it by hand. I sort of wandered around to see what choices are there to make your own custom parsers in python. As it turns out there are at least two good ones there. The first one is called pyparsing. This is the one I started off with. But after carefully converting the COBOL grammar from EBNF to pyparsing model it turned out that parsing just one program took like forever to complete. On the other hand it turned out to work pretty well on the sql, but after a while I decided to throw that out and reimplement it… I know, worst idea ever, but still, I was not much of a time-constrained, so I could afford that. And mind you, I’ve thrown away roughly three days of work, so not much harm was done there.

On the second take to the parsing issue I thought that actually being able to write the grammar as EBNF, since these are really much more readable than the pyparsing representation and they are also easier to change. After all, we cs guys are used to math-like symbols… So, with the application od simpleparse things really took off now. It took me about 10 workdays to get the cobol grammar to parse, some 2 more days to add the db2 sql (maybe not complete, but good enough for the programs here). So, anyway, the main thing was, that simpleparse is really a simple parser thingy. It does not support maximum length/most successful match but the first match only. This is crucial to defining grammar, you’ve gotta make the grammar list the longest expression first. The main problem was in the relational conditions, which in COBOL make the form of

IF ABC=1 OR 2 OR 3  OR DEF <= 123 AND WS-SOME-VAR IS NOT GREATER THAN ‘ABC’

now this is really strange to parse, especially the abbreviated condition (ABC=1 OR 2 OR 3, which actually means ABC=1 OR ABC=2 OR ABC=3). But you can get by somehow – at least I did. Anyway, the performance increase is dramatical. On my dual-core laptop the pyparsing stuff took two days to parse a simple program! With simpleparse it takes several seconds… Well naturally this is due to the parser’s implementation being way simpler (first match cuts the further comparisons), but if you’re carefull enough this thingy is capable of doing soooooooo much!

So in the end I guess that Steve Yegge was rigth when telling to learn that stuff about compilers. It definitely pays off to be aware that it is easier to make a parser that use regexes… Or at least it seems so 😉

Advertisements