You are currently browsing the category archive for the 'computational linguistics' category.
Mind42 is a web-based mind mapping app that I used about six months ago and forgot about. They just recently ended their so-called beta stage, which reminded me that they existed. After giving them another spin, I think I’m going to start using mind maps more often. I also used FreeMind for a while, but the collaborative features of mind42 are more appealing to me right now.
A mind map is a graphical representation of the relationship between concepts. You start with a central concept and build outwards. The way you organize the concepts is up to you and good mind mapping software will let you completely restructure the graph easily. For example, I created a mind map to keep track of all the places I had applied for jobs. First, I organized it by city and state. This is good for keeping track of where we might be moving, but it doesn’t really capture what I think about the jobs I’m looking at. Some jobs I’m looking at deal with summarization, some are for government contractors, some are linguistics-oriented, etc. By arranging them by some other sort of topic, I can visualize the options available to me. Also, the act of organization helps me think through the choices I am facing. The more you think about the issues, the better equipped you will be to make a decision. That doesn’t necessarily mean you will make the best decision..
I contend that mind maps are also good for paper writing. They offer ways of teasing out information and the relationship between the ideas you are trying to convey. Of course, your mileage may vary. Not everyone is helped by mind maps. But if you are, try using them the next time you have a paper to write.
One drawback to mind maps are the time involved. You can spend a lot of time constructing a mind map and reorganizing the ideas. There is a trade off between the time spent organizing your thoughts and the value or length of the task. If it is very important for you to have thought through something in detail, a mind map is probably the right choice. If you are just jotting down thoughts in the form of a blog post, they may be too time-consuming. I doubt your readers will mind if you spend that kind of time, though.
And no, I did not take my own advice in using them for this blog post. Maybe next time.
Here’s an exciting potential computational linguistics task: construct a mind map automatically from text. I can think of a few ways to get started…
So I am on the market after getting my masters. I’ve posted my resume to Dice and Monster and a couple others. Monster gets the most unsolicited calls. I’m finding that recruiters are an odd lot. There are some who are pleasant, though to a man (or woman) they’ve never heard of NLP or computational linguistics and have no idea how to help me (with the exception of the one or two recruiters I’ve contacted for NLP jobs). For the most part, they seem to not even read my resume. Oh, you have Java skills? How about this Java grunt job that only requires a bachelor’s degree? Waste of time. The best are the ones who contact me in broken English with a multitude of typos. Yeah, right.
I have been told that with my CMU degree, I should be looking exclusively at the big corporations: Google, Microsoft, Amazon, Yahoo, etc. If I do my time there, I can get a job anywhere and have a good career. That’s true, I’m sure. Something about startups is really attractive to me, though, so I’ve been looking at a lot of them. What if the only job I can get at a Googlosoftazonahoo is not NLP-related? Everything is so rushed. I have a September 1st exit date for CMU and I want to be in the city of my chosen job by then. Add lease problems. The problem is my decision to abandon academia didn’t come at the right time: back in the winter. I am, however, more confident than ever that it was the right decision.
This is research I did a while ago and presented Monday to fulfill the requirements of my Masters degree. The presentation only needed to be about 20 minutes, so it was a very short intro. We have moved on since then, so when I say future work, I really mean future work. The post is rather lengthy, so I have moved the main content below the jump.
I just completed the final requirements of my Masters degree today (the details of which I will save for a future post). It has been a difficult road, and I’m glad it’s done. I didn’t attend any sort of graduation ceremonies, because I don’t go for that sort of thing — at all. Until today, it didn’t feel like the weight was off my shoulders. Now I actually feel like celebrating! But I won’t, because I’m a nerd. I’m currently celebrating by working on a programming puzzle. And surfing the blagoblag.
I still have a couple months of servitude to complete the requirements of my fellowship, but the degree is mine.
The North American Computational Linguistics Olympiad is an annual competition open to US high school students that introduces kids to computational linguistics at a much younger age than people normally hear about it. I didn’t hear about CL until I was three years into my undergrad program. The instant I did hear about it, I knew I wanted to do it. Most people I talk to about it, look like I’ve just uttered a phrase of Klingon. I suspect most people don’t hear about it at all, or if they do, it’s sometime during their undergrad program and not at the beginning, when they might be better able to plan their educational career path. Also, CL is pretty much a graduate program and rarely taught before then. Granted, a lot of the maths involved are beyond what’s taught to high school students and early undergrads, but the linguistics is not. And thinking about linguistics computationally is not. So NACLO is doing an extremely valuable service which I support completely. And not just because one of my professors is one of the General Chairs of the organizing committee for it. She no longer can affect my grade and I have no need to suck up — so this is genuine. How’s that for full disclosure?
One of my google alerts popped up a post on a spam blog I tracked down to this original post, which talks about a lot of young kids doing some great things in science. In the post is an interview with last year’s winner, Adam Hesterberg. He said, “I’d never studied linguistics, and ‘computation’ sounded like boring calculation.” That reminded me of the fact that computation might mean a different thing for most people than it does for scientists. I’m no corpus linguist, so I’m not gonna try to find out right here. What I suspect is that computation has a more “hard work” connotation for people outside of science: it’s the “plugging and chugging” meaning. Inside science, it’s tacked onto the beginning of some other field to mean anything in that field that can be computed. Computational linguistics deals with the computable aspects of linguistic theories. A very quick search on wikipedia finds at least a dozen other computational fields:
- Computational biology
- Computational chemistry
- Computational economics
- Computational electromagnetics
- Computational engineering
- Computational finance
- Computational fluid dynamics
- Computational mathematics
- Computational mechanics
- Computational particle physics
- Computational physics
- Computational statistics
Is it a good idea to use this name when approaching high school students? What about language technologies? Well, the competition isn’t about language technologies, it’s about critical problem solving in a linguistics setting. And trying to fit that into a competition name isn’t going to work, either. North American Critical Problem Solving about Linguistics Olympiad (NACPSLO)? It makes me think of narcolepsy.
So my proposal is North American Logic and Language Olympiad (NALLO). It’s easy to say (rhymes with hallow) and accurately describes the subject matter. Plus, I think it has broader appeal. A lot of kids are interested in logic, language, or both. It shakes free of the negative connotation of computation and draws kids where they can be introduced to it a little more easily. The downside is that it doesn’t mention linguistics directly, so that might trouble some people who are a little more traditional about their outreach.
What do you think?
If you follow news on the semantic web or new search engines, you may have heard of hakia. TechCrunch has done a small write up about their new semantic search API. TechCrunch is brutally hard on startups who aren’t fully operational, so there is a lot of criticism in that article that I take with a grain of salt. I like seeing startups open their services with APIs and I think they deserve some benefit of the doubt. Maybe I’m looking at it the wrong way, though, and the fact that TechCrunch does make such a stink ensures the startup will correct the problem asap, rather than farting around for a while.
So I decided to finally fart around with OpenCalais a little. There’s a nice video on the site that gives you an impression of what it is capable of, but it’s also like all videos about software: propaganda. Calais is basically Named Entity Recognition (NER) software that can be accessed via a web API. Whereas a regular NER system might recognize named entities like people, organizations, and places, Calais also recognizes relationships like corporate acquisitions. To be a little more clear if you aren’t familiar with NER, it is basically the task of identifying the proper nouns in a body of text. Named entities aren’t always proper nouns, but that is one starting point. Examples would be: John Hancock (Person), New York (Place), and Apple (Organization). Calais recognizes relationships, which means we get an extra layer of information: Acquisition(Microsoft, Yahoo!).
Calais is put out by Reuters which has a long history of helping out the NLP and IR research communities with data sets. Being Reuters, the data sets are all newswire stuff, and Calais is produced in that spirit. Currently the relationships and named entities available reflect that bias, but the list is expanding and it is probably flexible enough for most domains. Their claim is that with each new release, there will be additional entities and relationships available. Also, the software is completely open source free for commercial and private use. For this, I give Reuters props.
OpenCalais uses SOAP or HTTP post to issue requests and you can take a look at their tutorials for exactly how to use it. After some very shallow digging on the googles, I found an open source project called python-calais, which is basically just a script that wraps some text and sends it to the Calais service, then processes the output. The output is in RDF (resource description framework), which is a type of xml document that is not very friendly to the human eye but is nice and powerful otherwise. The python-calais script uses an rdf library for python, so you’ll need to download that if you don’t already have it.
Running it on my most popular post, you get the following output:
93B6642D-0D7C-37Ab-A92F-66Ebfef13C8D :: Recommender Systems (Industryterm)
0Dccb106-442A-3848-Bd0B-A388E73F4C8C :: Chris Sternal-Johnson (Person)
Aab0D16A-Ad5A-348A-A8Dc-58Cf59A1Bc15 :: Kristina Tikhova (Person)
42F476A0-2Fae-3F36-808D-803E4F620Ab0 :: Java (Technology)
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
4003D863-C7A6-3E6F-8E3C-0913Bf2F8242 :: National Aeronautics And Space Administration (Organization)
77D1Ceb3-9900-3Dd7-8351-F29408B21412 :: Carnegie Mellon University (Organization)
Ee58Ef4B-1C98-3F8B-Aff8-3Fd6E3D76A9E :: Wonderful Site (Industryterm)
8F12E551-A8F1-3705-866C-D44D1A6A54F4 :: Richard M. Hogg (Person)
Adee23De-B1B0-37Ad-9E20-1Fa8094F6D39 :: Steel (Industryterm)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
B6A8Dbfa-Fd35-32Bb-9E05-A2811C480000 :: Mike Tan (Person)
Ed8B5Fe4-616A-36Ea-8C47-3Eea7C71Aee0 :: Ben Eastaugh (Person)
D3Bcba58-00Fc-33C5-9346-Dbf6A2441867 :: Machine Learning (Technology)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
38116E8D-F8B4-3D03-B0Ad-C9A24B888E61 :: Jason M. Adams (Person)
4386B07C-F6B8-3991-Af74-Ab11A951F0Ee :: David Petar Novakovic (Person)
Aa14303F-F9F0-31B8-Adff-3B9C68E0A9F1 :: Language Technologies Institute (Organization)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
As it picks up everything on the page, there is a lot included there that isn’t related to the post about Old English translation. Also, it picks up some weird so-called industry terms like “steel.” If you filter out just the text (manually), the output is a little more sensible:
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
(The codes are unique identifiers.) Unfortunately, some important terms are still missed, like Old English. So it appears Calais has some growing to do, but it’s off to a good start. Part of the problem might be that that blog post is out of domain. I imagine with time, it will continue to improve. We’ll see.
The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency. Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence. Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English). The judges give each sentence a score for both in the range of 1-5, similar to a movie rating. It became apparent early on that not even humans correlate well with each other. One judge may be sparing with the number of 5’s he gives out, while another may give them freely. The same problem crops up in recommender systems, which I have talked about in the past.
It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged. The better an MT metric correlates with how human judges would rate sentences, the better. This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems. If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them? The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.
Vilar et al. (2007) propose a new way of handling human assessments of MT quality: binary system comparisons. Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better. The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency. By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.
The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings. The total number of ratings for the previous method was two per sentence: O(n), where n is the number of systems (the number of sentences is constant). Binary system comparisons requires more ratings because the systems have to be ordered: O(log n!). In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.
What would be interesting to me is a movie recommendation system that asks you a similar question: which do you like better? Of course, this means more work for you. The standard approaches for collaborative filtering would have to change. For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies. Also, people will still disagree with themselves (in theory). You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure. You’d have to find some way to deal with cycles like this (ignoring it is one way).
References
Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation. 96-103. [pdf]
NLP app idea: construct random songs by scraping lyrics websites and stringing together common phrases. It’s a Pandora night for me and here were a couple lyrics that struck me as particularly meaningful. Both by Regina Spektor, introduced to me by Pandora before she became (semi-)famous.
And then you take that love you made
And stick it into some
Someone else’s heart
Pumping someone else’s blood
- “On the Radio”
Beneath the stars came fallin’ on our heads
But they’re just old light, they’re just old light
- “Samson”
I love how she takes the beautiful image of stars falling on their heads and strips it bare of all romanticism and attached meanings, exposing them for what they are: old light.
I was asked recently about the motivation for Abney’s DP (determiner phrase) hypothesis. That is, that determiners are not part of English noun phrases but head up their own phrases of which NPs are complements. I couldn’t remember the justification I was given in my Syntax I class, so I went back to the textbook (Syntax: A Generative Introduction by Andrew Carnie). I found the following interesting excerpt:
“… for lack of a better place to put them, we put determiners … in the specifiers of NPs. This, however, violates one of the basic principles underlying X-bar theory: All non-head material must be phrasal. Notice that this principle is a theoretical rather than an empirical requirement (i.e., it is motivated by the elegance of the theory and not by any data), but it is a nice idea from a mathematical point of view, and it would be good if we could show that it has some empirical basis.”
This clashes a bit with my empirical sensibilities. It represents very much the rational point of view in linguistics, that we can probe our own understanding of language by judging what we perceive to be grammatical or ungrammatical. The empiricist view would look at it from another angle: does it appear in data? So the theoretical view might be “nice” but if it is not supported by the data, it is crap.
Treebanks don’t use DPs (at least none that I’ve seen), so automatic parsers typically have no concept of them. I wonder if they would add any value? I’m guessing they would just run into sparsity issues since another set of tags have to be estimated. But who knows, the extra structure might be helpful in complex situations.
My blogarrhea has tapered off over the past few weeks due to a number of time constraints, stress, and anxiety. The wikipedia article for graphomania actually suggests blogging may be a form of graphomania. Currently, I’m visiting the University of Maryland tomorrow to take a look at the (cs) grad school and the CLIP (Computational Linguistics and Information Processing) laboratory. So maybe I won’t blog anything for a couple days or maybe not.
Anyhow, here are two views out of my hotel room of illustrious Silver Springs, Maryland. The direct view out is the roof of a Whole Foods.


So I was recently asked (and gave a very bad answer to) a question that has been haunting me ever since. What is the subfield of computer science where I am the strongest? First of all, in my undergraduate training, I was never really introduced to these ideas of subfields of CS explicitly. I knew intuitively there was a difference between people working on databases or on operating systems, programming languages or algorithms, but it wasn’t emphasized as a choice I would ever need to make. This is perhaps because I went to a relatively weak school in CS for my undergrad. But now that I’m in a rather strong CS school and pursuing a CS-related masters, the question should probably have entered my mind before now.
So when asked, I floundered about for an idea and spluttered out “algorithms” just because it seemed like it was hard to go wrong there. Well, I’ll leave the details out of this little memoire, but suffice it to say, I was wrong. A better answer would have been “none.” Where does natural language processing / computational linguistics fall in the list of subfields? Is it its own? Or is it part artificial intelligence, part algorithms, part whatever? I’ve seen it lumped with AI more closely in the past, but unfortunately AI escaped me as a possible choice when called upon in this high-stress scenario. Moreover, I haven’t really compartmentalized techniques as belonging to “AI” or “databases.” Is it useful to do that? I guess I do sometimes, but when people ask me to make big picture assessments of things I haven’t thought about much, it takes me a while to process it.
I hate interviews.
Systran is one of the oldest companies around that provide machine translation software. They power some language-pairs of Microsoft’s translation service, Altavista’s Babelfish, and quite a few others (including, until recently, Google). In the past, their software has been rule-based, so translation is done with a bilingual dictionary and a set of rules of how to change text from one language into another. Based on a recent bevy of jobs postings on Linguist List, it appears they are going statistical. Maybe they have been for a while, I don’t know, since I don’t actually follow what they do, but this piqued my interest.
If your interest is piqued too, the listings are for:
And, of course, salary ranges are not provided.
Stepping back in time in MT Eval from my last post, Liu and Gildea (2005) were among the first to really bring syntactic information to evaluating machine translation output. They proposed three metrics for evaluating machine hypotheses: the subtree metric (STM), the tree kernel metric (TKM), and the headword chain metric (HWCM). STM and TKM also had variants for dependency trees, which HWCM relies on. Owczarzak et al. (2007) extended HWCM from dependency parses to LFG parses. HWCM has attracted more attention since it showed better correlation at the sentence level than either STM and TKM (both versions) and outperformed BLEU on longer n-grams. It’s interesting to note, though, that the dependency-based tree kernel metric performed best of all at the corpus level. Sentence level granularity is typically more important for helping you tune your MT system.
The subtree metric is a fairly straightforward idea. You begin by parsing both the hypothesis and the reference sentences using a parser like Charniak or Collins to get a Penn TreeBank style phrase structure tree. You then compare all subtrees in the hypothesis to the reference trees, thresholding the number of matches by the best match in the reference trees. The formula is given below:

The tree kernel metric uses convolution kernels discussed by Collins and Duffy (2001). For the specifics of this method, I refer you to the respective papers (and I may post on it at a later date), but the general idea is that you can transform structured data (a tree) into a feature vector by using the kernel trick. Finding all subtrees of a tree can be exponential in the size of the sentence, which would make computation infeasible for large sentences. The kernel trick lets us operate in this exponentially-high-dimensional space with a polynomial time algorithm. Once we have constructed the feature vectors for the hypothesis and refernece trees, we can score them with their cosine similarity:

H(T1) and H(T2) are vectors with non-zero values for subtrees (dimensions) that appear in each tree, so the dot product of the two is the number of subtrees in common. The score is computed as the maximum cosine similarity between the hypothesis and the references.
Finally, the headword chain metric (HWCM) relies on dependency parses, which I touched on in my previous post.
In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains.
The HWCM score is calculated just like the STM except by comparing headword chains. The difference between the HWCM and the dependency version of the STM is that STM considers all subtrees whereas HWCM only looks at direct mother-daughter relations (no cousins or sisters).
References
Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language. In Advances in Neural Information Processing Systems.
Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.
Karolina Owczarzak, Josef van Genabith, and Andy Way. 2007. Labelled Dependencies in Machine Translation Evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104-111, Prague, June 2007.
Since Papineni et al. (2002) introduced the BLEU metric for machine translation evaluation, string matching functions have dominated the field. These metrics work well enough, but there are cases where they break down and more and more research is revealing their biases. Also, BLEU does not correlate especially well with human judgments, so the quality of MT would benefit from a metric that better captures what makes a good translation.
A recent trend in this direction has been to introduce linguistic information in MT eval. Liu and Gildea (2005) used unlabeled dependency trees to extract headword chains from machine and reference translations to evaluate MT output. To define a few terms, reference translations are human translations that machine translations are compared to during evaluation. In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains. Liu and Gildea compared the headword chains constructed for both machine and reference translations and produced a metric based on comparing the two sets of headword chains. These chains were not annotated with any sort of grammatical relation (subject, object, etc), so they are unlabeled dependencies.
Owczarzak et al. (2007) have extended the work by Liu and Gildea (2005) using labeled dependencies. They parsed the pairs of sentences with a Lexical Functional Grammar (LFG) parser by Cahill et al (2004). In LFG, there are two components of every parse: a c-structure (i.e. a parse tree) and an f-structure, which describes the features of the lexical items. An example of an LFG parse from their paper is given below. F-structures are recursive structures with a head containing all of its constituents. From the f-structure it is easy to construct dependency trees. The bonus is that the f-structure provides the grammatical relations between items in the dependency trees. In the example below, the dependency subj(resign, john) has the grammatical relation of subject. That is, John is the subject of the sentence headed by the verb resigned.

Their metric is then simply a comparison of these labeled dependency headword chains using precision and recall to compute the f-score (harmonic mean). One of the coolest things in the paper is how they handle parser noise. Statistical parsers are not perfect. They estimate probabilities for rules from labeled data. In natural language, variation is pretty much unlimited, so no matter how big the training corpus, there will always be things the parser has never seen before. Also, we are dealing with imperfect input (by the MT systems or humans) so the problem of noise could be even greater. They address this by running 100 sentences through the various MT metrics they are comparing (including their own) as both the reference machine translation. This produces the “perfect score” for each metric since they are identical. Next, adjuncts are rearranged in the sentence so that the resulting meaning has not been changed, but the structure has. Each MT metric now evaluates the new sentence compared to the original and computes a score. For the LFG parse, the f-structure should remain the same in both cases, so any divergence can be attributed to parser noise. In order to this noise, they used the n-best parses and were able to increase the f-score, bringing it closer to the baesline (ideal). So instead of just comparing the best parse for the reference and machine translation, they combine the n-best parses to compute the f-score.
The result is that they get correlations with human judgments competitive with the best system they compare themselves to (METEOR, Banerjee and Lavie, 2005), beating it for fluency and coming in a close second overall. As far as future work goes, there are quite a few extensions they mention in the paper. The LFG parser produces 32 different types of grammatical relations. In the current setup, they are all weighted the same, but they would like to try tuning the weights to see how that affects the score. Another extension they propose is using paraphrases derived from a parallel corpus. There has been other work done on paraphrasing for MT evaluation (notably Russo-Lassner et al., 2005). One thing I am curious about is whether changing the weight on the harmonic mean would have an impact on correlation. METEOR uses the F9-score while the typical thing to do is F1. It’s not clear that weighting precision and recall equally is the best thing to do.
Interesting stuff, though. I hope they continue the work and maybe we’ll see something in this year’s ACL.
Update
Karolina Owczarzak has confirmed they were using the F1 score and that different F-scores did not lead to significant improvements. I also added the image I forgot to include in the original post.
References
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the Association for Computational Linguistics Conference 2005, pages 65-73, Ann Arbor, Michigan.
Aoife Cahill, Michael Burke, Ruth O’Donovan, Josef van Genabith, and Andy Way. 2004. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26, pages 320-327, Barcelona, Spain.
Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.
Karolina Owczarzak, Josef van Genabith, and Andy Way. 2007. Labelled Dependencies in Machine Translation Evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104-111, Prague, June 2007.
Grazia Russo-Lassner, Jimmy Lin and Philip Resnik. 2005. A Paraphrase-based Approach to Machine Translation Evaluation. Technical Report LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57, University of Maryland, College Park, Maryland.
OpenEphyra is a question answering (QA) system developed here at the Language Technologies Institute by Nico Schlaefer. He began his work at the University of Karlsruhe in Germany, but has since continued it at CMU and is currently a PhD student here. Since it is a home-grown language technologies package, I decided to check it out and play around. This is the first QA system I have used that wasn’t integrated in a search engine, so this isn’t exactly an expert review.
Getting started in Windows (or Linux or whatever) is pretty easy if you already have Apache ant and Java installed. Ant isn’t necessary, but I recommend getting it if you don’t have it already. It’s just handy. First, download the OpenEphyra package from sourceforge. The download is about 59 MB and once it’s done unpack it in whatever directory you want. Assuming you have ant installed, all you have to do is type ant to build it, though you may want to issue ant clean first. I had to make one slight change to the build.xml file to get it to run, which was on line 55: <jvmarg line="-server& #13;-Xms512m& #13;-Xmx1024m"/>, which had to be changed to <jvmarg line="-server -Xms512m -Xmx1024m"/>. Easy enough. Then to run it, all you have to do is type ant OpenEphyra.
After taking a short bit to load up, you can enter questions on the command line. Based on what I can tell from the output, it begins by normalizing the question (removing morphology, getting rid of punctuation). Then it determines the type of answer it is looking for, like a person’s name or a place and assigns certain properties to what it expects to find. Next it automatically creates a list of queries that are sent to the search engine(s). The documentation indicates that the AQUAINT, AQUAINT-2 and BLOG06 corpora are included (at least preprocessing is supported), but there are searchers for Google, Wikipedia, Yahoo and several others. Indri is a search engine which supports structured queries and OpenEphyra auto-generates some structured queries from what I saw playing around. After generating the queries, they are sent to the various searchers and results are obtained and scored. Finally, if you’re lucky, you get an answer to your question.
Here are the results of screwing around with it for a few minutes:
- Who created OpenEphyra?
- no answer (sorry, Nico)
- Who invented the cotton gin?
- Eli Whitney
- Who created man?
- God
- What is the capital of Mongolia?
- Ulaanbaatar
- Who invented the flux capacitor?
- Doc Brown (awesome!)
- Who is the author of the Mendicant Bug?
- Zuckerberg — damn you, Facebook! :(
- How much wood can a woodchuck chuck?
- no answer (correct)
- What is the atomic number of Curium?
- 96 (also correct)
- Who killed Lord Voldemort?
- Harry (correct, but partial)
- How many rings for elven kings?
- 3021 (so, so very wrong)
Fun stuff! It’s not anywhere near perfect, but there are definite uses and the thing is ridiculously easy to install and use. Also, it’s in Java, so you can integrate it with your own system with very little effort. Depending on what sort of question you are looking for answers to, you get various levels of results. Factual questions about geography and people tend to do better than questions about numbers and fiction, as you might expect. Also, why-questions are not supported.
Another bonus is the project is open source, so if you’re into QA, you can help develop it.
At ACL this year, the Third Workshop on Stastical Machine Translation will be held and they are featuring a shared task on MT evaluation. The shared task will involve evaluating output from the shared translation task, which will be released on March 24th, with short papers and rankings due on April 4th. I created an MT evaluation system (pdf) last year for a class (on MT, no less), though I doubt it would do particularly well. I outperformed BLEU, but fell short of METEOR. In any case, it might be interesting to play with the data and certainly will be interesting to read the papers. My system does perform sentence-level ranking as one of its primary goals, which is also a goal stated by the shared task.
This is the question I will have to answer over the next few weeks.
One of my classes this semester is the Advanced Machine Translation Seminar (and I hope that link works outside of CMU). Each of us who has registered for the class will present a certain topic in MT and then do a literature review about it by the end of the semester. Originally I had wanted to cover how word sense disambiguation (WSD) has been applied to statistical machine translation, but that overlapped with another topic on bringing in context to MT. In simple terms, WSD is just the task of figuring out which of the many definitions a word has applies in the given circumstances. WSD systems use the context around the word to determine its sense. Thus, it is just another way of bringing context into MT. We determined there was no clear way of separating the topics so that I could still do that, so since mine was the more specific it seemed reasonable to me that I should change topics. No one else is presenting on machine translation evaluation (MT Eval), so I opted for that.
MT Eval is actually a pretty vibrant topic at the moment. For some quick background, machine translation systems produce woefully inadequate translations much of the time. If you have any doubt of this, try to translate a random web page using any of the many free online services. You will get many disfluencies, untranslated words, downright gibberish, and much worse. Not all of it will be bad, of course, but much of it will be. It is a hard problem, and many MT researchers believe it to be AI-complete (the Wikipedia article mentions MT explicitly). In order to improve machine translation, you need some way to automatically evaluate how well you are doing. Currently this is done using automatic metrics that compare machine output to (usually multiple) human translations (aka reference translations). The most commonly used metric is BLEU (pdf), but a rising star is METEOR, developed in part by one of my professors. I won’t go into these metrics any further here at the moment, and I recommend interested parties check out the papers. What these metrics aim to do is gauge how similar the machine output is to the reference translation(s).
The problem with MT Eval is that in order to be able to automatically tell whether something is a good translation, we would have to know exactly what goes into making a good translation (and by good I mean human-level). If we could do that, we would have solved MT!
More to come.
In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump. In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task.
FSMNLP 2008 (Finite State Methods and Natural Language Processing) has issued their first Call for Papers (CFP). The deadline is May 11, 2008 and the conference will take place on September 11-12, 2008. Not the best time to be travelling perhaps, but this year it will be in Ispra, Lago Maggiore, Italy! That’s in the far north of Italy, right next to the Swiss border. From the pictures I’m finding on Google, it’s a gorgeous resort area.

The sorts of things they are interested in include:
- NLP applications and linguistic aspects of finite state methods
- Finite state models of language
- Practices for building lexical transducers for the world’s languages
- Specification and implementation of sets, relations, and multiplicities in NLP using finite state devices
- Machine learning of finite state models of natural language
- Finite state manipulation software
The special theme this year will be on high performance finite state systems in large scale NLP applications.
I am going to try really hard to get something together for it this year. I had a project last year that was potentially worth submitting, but I wasn’t able to get it done in time. Unfortunately, it has languished since then as other, more pressing matters have superceded it. Going to Northern Italy ought to be motivation enough, though, don’t you think?
The entire CFP is below the jump and is also available on their website:
In my previous post on cognate identification, I gave two definitions for cognates: strict and loose (orthographic). Strict cognates are words in two related languages that descended from the same word in the ancestor language. Loose cognates are words in two languages that are spelled or pronounced similarly (depending on the data consists of phonetic transcriptions or plain text). These two definitions help form the basis for how I choose to classify approaches to doing cognate identification, but the source of data is the bigger factor, in my opinion. The orthographic approach looks at plain text and attempts to do some sort of string matching or statistical correlation based on the written (typeset) characters of the language. The phonetic approach relies on phonetic transcriptions of words in the language. Phonetic transcriptions are usually done in the International Phonetic Alphabet (IPA) but any standard form of representing sounds will work. One such example is the Carnegie Mellon Pronouncing Dictionary. Phonetic approaches may use string matching techniques, but there are also a number of inductive methods based on phonology that have been tried to good effect.
So a good question might be why does the data being used matter so much to these techniques? Why not classify the two approaches as to whether they look for loose or strict cognates? Might there not be another way of classifying the approaches to cognate identification beyond these two? Or is there an entirely different set of classes that would better describe them? To answer the last two questions, I will say that there very well may be better ways of classifying these algorithms. As Anil pointed out in the comments to my last post, the two definitions lend themselves to different applications. From the papers that I read, it seemed that when researchers looked at plain text data, there was a completely different mindset than in papers where researchers used phonetic transcriptions. For the former, the goal was usually finding translational equivalences in bitext and for the latter the goal is more as an aid to linguists attempting to reconstruct dead languages or establish relationships between languages.
With plain text, it is very difficult to infer sound correspondences between two languages. In Old English, the orthography developed by scribes corresponded directly to the spoken form. As English changed over the 1000+ years since then, the orthographic forms of words have frozen in some cases and not in others. For example, the word knight was originally spelled cniht and the c and h were both pronounced. The divergence of orthographic and phonetic forms can result in any number of problems and so it influences the ways of thinking about the task. On the other hand, phonetic approaches suffer due to data scarcity. Obtaining phonetic transcriptions is expensive as it requires the effort of linguists or individuals with specific, extensive training in the area. There are ways of obtaining phonetic transcriptions automatically, but these methods are not perfect and so result in noisy data, making this data practically useless for historical linguists.
In my next post, I will go into orthographic approaches in more detail, describing some of the papers I looked at and the methods they used. After that, I will begin discussing phonetic approaches, which are more numerous. I will also begin to look at how machine learning is being used to tackle cognate identification.
View all posts on cognate identification.
I recently finished a literature review for my Language & Statistics 2 class. The topic was computational models of historical linguistics and my partner and I focused on cognate identification and phylogenetic inference. We split the work and my part was cognate identification. So I decided to blog about it for a bit and maybe someone out there will have something to offer. Granted, that won’t help my grade, but improving my understanding is more important. You can also check out our presentation.
First of all, to frame the problem, historical linguistics is a branch of linguistics that studies language change. Language can change in many ways, but the methods we looked at pretty much solely focused on phonological and semantic changes, with a few brief nods to syntactic change (on the phylogenetic inference side). The main tool used by historical linguists in reconstructing dead languages is the comparative method. This method looks at two languages suspected of being related and tries to infer the regular sound changes that led to the divergence. By examining lists of suspected cognates, they find sound correspondences — sounds that appear in similar contexts in both languages, but which aren’t necessarily the same phoneme. For example, the word for beaver in English and German derives from the Proto-Germanic word *bebru. In Old English, this became beofor (the f sounds like a /v/). In modern German, the word is Biber, with the /b/ phoneme preserved as it was in Proto-Germanic. So we could infer a sound correspondence between English /v/ and German /b/ in this context.
So what are cognates? If you have studied a second language, you no doubt have heard this term. I propose the following two classifications for cognates. A loose cognate will be a pair of words in two languages that is spelled or pronounced the same, with some minor variations. In this way, French resumé and English resumé would be considered cognates. Loose cognates have also been called orthographic cognates. A strict cognate is a pair of words in two related languages that descended from the same word in the ancestor language. Loan words are words that come into a language directly from another language, such as resumé. These words do not undergo the regular sound changes that are observed in strict cognates and so they are not considered cognates at all by historical linguists.
What is the effect the distinction between these two definitions would have on computational approaches to this task? I will look at this further in a future post, but feel free to post your thoughts in the comments.
The Call for Papers (CFP) is out for the Student Workshop at next year’s ACL (Association for Computational Linguistics) conference. I’ve been playing around with a couple of side ideas, it would be nice to have something to submit to this. We’ll see. Full CFP is below the jump.
So you want to automatically parse sentences without having to go through all the trouble of figuring it out for yourself? You’ve come to the right place. This brief tutorial is aimed at students who are interested in computer science and linguistics who maybe want to dip their feet in the water of computational linguistics without having to understand immediately all of the daunting details. In other words, what I wish I had two years ago before applying to graduate school.
Google announced that it has abandoned Systran as its translation system for the 22 languages it services besides Arabic, Chinese and Russian. Systran is one of the oldest machine translation companies around. When Microsoft launched its service recently, it announced that it would be supplementing its translations with Systran. Systran uses rule-based systems that have been massively tweaked to produce results that most would agree are still pretty crappy. They get some basic stuff right, but once you start venturing off into uncommon word usages and complex constructions, all bets are off. Some translation sites use Systran and others like freetranslation.com use their own system. Babel Fish is perhaps the most well-known site still using Systran.
So Google is switching over to its own statistical machine translation system for all 25 language pairs. Statistical machine translation systems typically look at two different kinds of text: aligned text in two languages (bitext) and monolingual text. The monolingual text is used to build a statistical model of the language so that output will conform to the target language rather than the original. For example, in German, the auxiliary verb comes in second position as in English, but the main verb often comes in final position. Reordering properly isn’t easy and this model helps make the output more natural. Bitexts are texts that have been translated from language to another and then aligned word-by-word. The actual alignment may be done by hand at the sentence level but the vast amount of human effort involved means that at the word level it is usually done automatically. Getting good alignments is an ongoing area of research that is quite far from perfect.
The thing that Google has going for it is that with statistical machine translation, the more data the better. And Google is overflowing with it. It’ll be interesting to see how their systems progress.
Well, the Call for Papers is out for ACL 2008 (Association for Computatonal Linguistics), which will be held in the city of my birth. Columbus, Ohio is such a short drive, it’d be a shame if I didn’t attend, even if I’m probably not submitting anything. The trick is getting someone else to pay for it!
Conference dates: 15-20 June 2008
Deadline for full papers: 10 January 2008
Deadline for short papers: 14 March 2008
The full Call for Papers is below the jump.
Accelovation has a job for a Computational Grammarian. They are looking for someone with a masters or PhD and the salary will be “commensurate with education and experience.” I would probably be qualified for this job (at least according to the short description) and even more so when I graduate next spring. I have at least worked with building LFG-style grammars in a couple classes and this is one particular area I find entertaining. Building grammars can be tedious at times, but it’s also like a giant puzzle with many sub-puzzles and you have to find a way to put them all together. Same with building finite state transducers.
But I’m always bothered by job postings that don’t list salary ranges. I don’t want to show up for a job interview and have them ask, “what were you thinking in terms of salary?” If I give them a number too low, they’ll think I don’t even think I’m worth much, so they shouldn’t hire me. If I give them a number too high, they’ll think I have a high opinion of myself and am not worth the asking price. I operate on the assumption that these are reasonable thoughts and not just remnants of social insecurity. Am I nuts?
I want some resource that I, as a job seeker, can use for free that will tell me what I can expect to be making at a certain job. When I’ve found the occasional resource, they have things like “Computer Analyst IV” and “Database Administrator III”. But nothing like “Computer Science Researcher” or “Computational Linguist” or “Language Technologies Researcher” or whatever. The closest I have found is Payscale, which let me put in “Computational Linguist”, but then asked me a few questions that basically amounted to “never heard of it.” Scientific Linguist was the closest thing, but not really accurate. When I tried “Scientific Researcher”, I got that there weren’t enough results to compare.
Blerg.
In a recent press release, kannuu is claiming to have revolutionized text entry. They claim that you can now perform text entry with just your thumb at the same speed of a regular keyboard. Too good to be true? Here is their method, complete with Hype™.
“Advancing text entry exponentially, kannuu’s powerful and precise Partial Word Completion® technology enables users with a fail-safe text entry solution. The kannuu application appears on device, as a four-point diamond shape, comprised of the most popular letters in the database it is indexing, with the center kannuu logo leading to the next set of choices.”
They registered a trademark on the phrase “partial word completion”?? Blerg. Not only do they have an über lame web 2.0 name in lowercase, they gotta stop people from marketing a similar technology under their oh-so-not-original name. Why does this make me so angry? Anyhow, I’m running off sideways on a rant that is pretty insignificant.
The real point here is the potential for coolness. So here is the technology: you enter a letter, it presents you with a “diamond” shape and the most common letters or group of letters that follow the letter(s) you just entered. In this way, most of your everyday phrases will be right up at the top of the list of things you’re presented so you could potentially be entering words with fewer keystrokes and all with very little thumb movement. This could really revolutionize key input and maybe bring pocket computers to reality [source].
So here is what I think the technology is based on. A very common technique in language technologies is the use of n-grams. So they use a character-based n-gram model to predict the most common letter or letters that you would type next based on some corpus. This isn’t anything new. Cell phones already have a T9 input method that guesses the most common word based on the single letters you choose. This isn’t all that different. If they have done the interface well, that could be a serious improvement.
If you’re interested in character-based n-gram models, I go into them in more depth after the jump.
Linguistic Issues in Language Technology (LiLT) is a new open-access journal in computational linguistics. The journal will focus on techniques that bring linguistics back into language technologies (LT). LT currently focus a lot on statistical techniques and sometimes can ignore linguistic insight altogether, but the field is beginning to swing around from the purely statistical approach to one that takes linguistic insight into account and merges it with statistical methods.
Curious about what sort of credibility this journal would have, I browsed the editorial staff and found some pretty big hitters. Following are some of the names that stood out to me. Christopher Manning of Stanford wrote the textbook used in my Language and Statistics class. Kemal Oflazer was one of my previous professors, who was visiting CMU last year. He’s done a lot of work with finite state transducers for morphological analysis of Turkish, among other things. Mark Liberman and Aravind Joshi of the University of Pennsylvania are pretty well known and accomplished. Aravind Joshi came up with Tree Adjoining Grammar and both he and Martin Kay won the ACL Lifetime Achievement Award. Mark Steedman is the current president of the ACL (Association for Computational Linguistics). Jason Eisner has done a lot of work on applying statistics to linguistics approaches and advised one of my current professors, Noah Smith. Philip Resnick has done a lot with word alignment and statistical machine translation.
The 9th annual Conference on Intelligent Text Processing and Computational Linguistics is calling for papers. The conference will be held in Haifa, Israel from February 17-23, 2008. Two of the keynote speakers were professors of mine last semester. Alon Lavie works on machine translation (including for low-resource languages) and machine translation evaluation. Kemal Oflazer was a visiting professor here at CMU last semester and advised me for a lab project on Old English morphology. He’s one of the leading proponents of finite state technology for morphological analysis and has done a lot of work with Turkish, which has a very rich morphology.
Submission Deadline: October 15, 2007.
The full CFP is given below:


