So you want to automatically parse sentences without having to go through all the trouble of figuring it out for yourself? You’ve come to the right place. This brief tutorial is aimed at students who are interested in computer science and linguistics who maybe want to dip their feet in the water of computational linguistics without having to understand immediately all of the daunting details. In other words, what I wish I had two years ago before applying to graduate school.
What is Parsing?
Parsing can mean many things. In this case, it means taking a sentence of natural language (say English) and producing a list of parts of speech (noun, verb, etc) and arranging those parts into meaningful phrases. If you have studied linguistics at all, you already know this. If you haven’t studied linguistics, you probably know this without knowing it. Think about the following sentence:
The scared cat jumped over the confused dog.
You probably recognize the subject of the sentence as The scared cat and what the cat jumps over as the confused dog. These are noun phrases. The second noun phrase can be combined with the preposition over to form a prepositional phrase, which describes the act of jumping done by the cat. This combines with the verb to make a verb phrase. In computational linguistics, there are many applications for having this sort of knowledge about a sentence.
Now, we can do this automatically using a parser. Most parsers today are probabilistic, meaning they build up a statistical model from real world data (sentences written by humans). Many were trained from the Penn Treebank, which are 40,000 sentences from the Wall Street Journal and Brown Corpus that were annotated by a group of people at the University of Pennsylvania. By hand, they parsed these sentences in a consistent (mostly) format that could be read by a computer. Using this information about the structure of English, we can assign probabilities to certain grammatical rules and use those probabilities to guess which would be the best thing to do with sentences that we have never seen before.
Finding a Parser
There are many parsers out there. Most of the modern parsers perform about the same, some faster, some slightly more accurate. For the purposes of this tutorial we will focus on one freely available, platform independent, open source parser. The Stanford Parser is written in Java, so all you need to use it is right out of the box is Java 1.5 or higher. The parser is oriented towards being used in a linux environment, but it will work fine on a Windows or Mac machine assuming you have the correct version of Java.
Running the Parser
The Stanford Parser can be called from the command line to parse files of sentences, but it also comes with a GUI that lets you open a file or a web page. To run the GUI, go to the folder where you unzipped the parser and type the following line:
java -server -mx600m -cp “stanford-parser.jar” edu.stanford.nlp.parser.ui.Parser
Once the GUI opens, click the “Load Parser” button.

Navigate to the folder where you unzipped the parser and choose the file englishPCFG.ser.gz (any of the files ending in .ser.gz are parsers, so feel free to play around). The Stanford Parser supports multiple languages including Arabic, Chinese, English and German. The parser in this example is the English Probabilistic Context-Free Grammar parser. Explaining just what that means goes beyond the scope of this little how-to, which focuses simply on getting you to the point where you can produce trees.
Once you have chosen the parser, click the “Load File” button. You can enter the URL of a web page you want to parse or enter your own file. Or instead you can just start typing in the top box. Select the sentence you want to parse and click the “Parse” button. Your parse tree will appear in the bottom box.

Evaluating the Parser
This part is a little bit trickier. Let’s say you need to find out how well the parser is doing. One common tool for determining this is called evalb. The reason this is tricky is because you need a set of test sentences that are already parsed according to the formalism used by your parser. The parses output by the Stanford Parser look like this:
(ROOT
(S
(ADVP (RB Now))
(VP (VB go)
(VP (VB subscribe)
(PP (TO to)
(NP (PRP$ my) (NN feed)))))
(. .)))
So you have to hand-annotate some number of sentences yourself. I’d recommend you take a look at the manual for the Penn Treebank and the tagging guide. Nothing is ever easy, right? So once you have hand-coded annotations and the output from the Stanford Parser, you will need to create two files. The gold standard file contains your hand-coded annotations. The test file contains the output of the parser. You have to remove all the line breaks from the files so that there is only one parse per line. Then you can run evalb using the following command:
evalb -p new.prm gold.txt test.txt
I know this works for Linux. I think it works for Windows as well, but you might have some trouble compiling it.
The output of the evaluation tool will tell you things like bracketing precision, bracketing recall, bracketing f-measure, number of crossing brackets and tagging accuracy. Check out the Wikipedia article on Information Retrieval for information about precision, recall, and f-measures. What is being evaluated here is how often the brackets predicted match what the brackets actually should be. Tagging accuracy is the accuracy of the parser in choosing the correct part of speech for each word.
More on Trees
Another tool you can use to create trees from the bracketed output of a parser (perhaps to help you write your gold standard files for evalb) is the phpSyntaxTree project. While you can’t use the parantheses from the parser output, it’s an easy thing to substitute square brackets. The output is a nice, clean tree.
[S [NP [DT The] [NN output]] [VP [VBZ is] [NP [DT a] [ADJP [JJ nice] [, ,] [JJ clean]] [NN tree]]] [. .]]

Final Notes
I have gone through the trouble of posting this tutorial because I hope I eventually help someone trying to explore the interesting world of parsing. If you find any parts of the tutorial confusing, wrong, outdated, or have any other questions, please leave me a comment. I expect I have made at least a few mistakes in my description. Please let me know, no matter how minute.




4 comments
Comments feed for this article
2 August 2008 at 08:28:44
Hok
Hi- I’m glad that I’ve found this post, because it seems very helpful for my experimental chatterbot project- it’s called ANNA, written in Prolog.
The plan is, I need it to be adaptive.
It currently would take Natural Language as a ’sense’- i.e. the only stimuli she’ll ever receive are strings from the keyboard. Therefore, she would have to start off with quite a lot of innate knowledge about the English language and how to interpret it correctly and assimilate, or reject incoming propositions, keeping, all the while, a consistent model of the world within her knowledge base.
At the moment, I am in need of a way of tagging parts of speech, so that she could identify just what a sentence ‘means’ rather than simply extracting word after word and rearranging them.
But, for her to be fully ‘adaptive’, it is important that the parser, as well as the ’semantics extraction’ part (after POST), also expressed in Prolog. She must be able to reason about the way she does things, and she can, only by using Prolog’s meta-programming facilities. So, I wonder if you know of any good Prolog-based POST libraries that require little or no external files, and yet provide a good English PCFG parser?
I fear I may well have to roll my own, at the expense of a lot of time and re-inventing the wheel.
Cheers!
2 August 2008 at 08:39:04
Jason Adams
I don’t have first-hand experience with anything, since the only prolog I’ve written was a homework assignment in an AI class. A quick hunt on the Googles finds a couple things though, that may be of interest:
Pronto: http://www.ai.uga.edu/mc/ProNTo/
Book on NLP with Perl and Prolog: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/slides.shtml
Anyhow, good luck. Might want to try asking on the comp.ai.nat-lang newsgroup: http://groups.google.com/group/comp.ai.nat-lang/topics?hl=en
6 November 2008 at 19:51:50
shaun
I am surprised at the limitations created by parsers generally. They seem to be merely scratching the surface of language, almost as if we are approaching things from the wrong perspective. We probably need some Einstein to come up with an entirely new way of looking at the issue of parsing.
6 November 2008 at 22:32:15
Jason Adams
I’m not sure exactly what you mean by the “limitations created by parsers.” Do you mean their inability to recognize certain things?
Statistical parsers are inherently flawed, though. In order to be computationally tractable you have to make a number of assumptions that are just plain wrong. The trick might be to figure out a way of incorporating world knowledge without blowing up the computation and taking forever to build. Or maybe it will take a completely different approach to solve… Who knows? :)