You are currently browsing the category archive for the 'linguistics' category.

The North American Computational Linguistics Olympiad is an annual competition open to US high school students that introduces kids to computational linguistics at a much younger age than people normally hear about it. I didn’t hear about CL until I was three years into my undergrad program. The instant I did hear about it, I knew I wanted to do it. Most people I talk to about it, look like I’ve just uttered a phrase of Klingon. I suspect most people don’t hear about it at all, or if they do, it’s sometime during their undergrad program and not at the beginning, when they might be better able to plan their educational career path. Also, CL is pretty much a graduate program and rarely taught before then. Granted, a lot of the maths involved are beyond what’s taught to high school students and early undergrads, but the linguistics is not. And thinking about linguistics computationally is not. So NACLO is doing an extremely valuable service which I support completely. And not just because one of my professors is one of the General Chairs of the organizing committee for it. She no longer can affect my grade and I have no need to suck up — so this is genuine. How’s that for full disclosure?

One of my google alerts popped up a post on a spam blog I tracked down to this original post, which talks about a lot of young kids doing some great things in science. In the post is an interview with last year’s winner, Adam Hesterberg. He said, “I’d never studied linguistics, and ‘computation’ sounded like boring calculation.” That reminded me of the fact that computation might mean a different thing for most people than it does for scientists. I’m no corpus linguist, so I’m not gonna try to find out right here. What I suspect is that computation has a more “hard work” connotation for people outside of science: it’s the “plugging and chugging” meaning. Inside science, it’s tacked onto the beginning of some other field to mean anything in that field that can be computed. Computational linguistics deals with the computable aspects of linguistic theories. A very quick search on wikipedia finds at least a dozen other computational fields:

Is it a good idea to use this name when approaching high school students? What about language technologies? Well, the competition isn’t about language technologies, it’s about critical problem solving in a linguistics setting. And trying to fit that into a competition name isn’t going to work, either. North American Critical Problem Solving about Linguistics Olympiad (NACPSLO)? It makes me think of narcolepsy.

So my proposal is North American Logic and Language Olympiad (NALLO). It’s easy to say (rhymes with hallow) and accurately describes the subject matter. Plus, I think it has broader appeal. A lot of kids are interested in logic, language, or both. It shakes free of the negative connotation of computation and draws kids where they can be introduced to it a little more easily. The downside is that it doesn’t mention linguistics directly, so that might trouble some people who are a little more traditional about their outreach.

What do you think?

I have been interested in alien (invented) languages since my first brush with elven in the Lord of the Rings. I checked out The Klingon Dictionary from the library in high school and currently own a copy of it and The Languages of Middle Earth.  During high school, I nerdily amused myself by attempting to develop a language for Antarians, which involved gutturals and whistles.  Speaking it myself was nearly impossible and I would occasionally practice, trying to go from a growling sound to a whistle as quickly as my human apparatus would permit.  I imagine the average passerby might have considered calling the police to have me committed, or at least checked for rabies.

New Scientist has a brief article about the possibility of actually preparing for what alien languages might be like.  The argument that Terrence Deacon of UC Berkeley makes (according to the article) is that language serves a purpose.  It is a communication system for describing the world and since the world is in some way a fixed point of reference (though perception of the world is not), then abstract symbolism is a feature common to all languages.

At one point, the study of xenolinguistics would have been a dream job for me.  A nice office at NASA, a field that will probably never be verifiable.  Could you ask for more?

I was asked recently about the motivation for Abney’s DP (determiner phrase) hypothesis. That is, that determiners are not part of English noun phrases but head up their own phrases of which NPs are complements. I couldn’t remember the justification I was given in my Syntax I class, so I went back to the textbook (Syntax: A Generative Introduction by Andrew Carnie). I found the following interesting excerpt:

“… for lack of a better place to put them, we put determiners … in the specifiers of NPs. This, however, violates one of the basic principles underlying X-bar theory: All non-head material must be phrasal. Notice that this principle is a theoretical rather than an empirical requirement (i.e., it is motivated by the elegance of the theory and not by any data), but it is a nice idea from a mathematical point of view, and it would be good if we could show that it has some empirical basis.”

This clashes a bit with my empirical sensibilities. It represents very much the rational point of view in linguistics, that we can probe our own understanding of language by judging what we perceive to be grammatical or ungrammatical. The empiricist view would look at it from another angle: does it appear in data? So the theoretical view might be “nice” but if it is not supported by the data, it is crap.

Treebanks don’t use DPs (at least none that I’ve seen), so automatic parsers typically have no concept of them. I wonder if they would add any value?  I’m guessing they would just run into sparsity issues since another set of tags have to be estimated.   But who knows, the extra structure might be helpful in complex situations.

While watching the 2000 version of Henry James’ The Golden Bowl, I heard the once-common phrase “The deuce only knows…”  I’m always looking for vintage profanity, and this appealed to me strongly.  I’ve heard it hundreds or thousands of times before, of course, but here it was brought to the fore of my attention.  After some brief research, I found ties to 16th Century Northern German, Family Guy, and playing dice.  The word deuce seems most strongly tied in meaning to “the devil,” and is used interchangeably in old-fashioned profanity (cf. What the devil and What the deuce).

There are attested uses of the phrase “Was der Daus!” in German from the 16th Century, which has my money for being the real origin of the phrase.  Daus meant “devil” though the modern German is “Teufel.”  Deuce also means “two” and comes from the French deux.  Supposedly, the combination of the German phrase and the playing of dice led to the phrase entering English usage.  Rolling two (the Devil’s eyes) inspired the curse, since that was the lowest score and therefore, a loss.  I’m not sold on this particular coincidence.  It seems too much like folk etymology of the sort you hear in email forwards.  Lastly, while I enjoy Family Guy enormously when I hear it, I very seldomly get the opportunity to watch an episode, so the tie to Stewie was lost on me until Google unearthed it.

And when OpenEphyra is given the question What is the origin of the word deuce? the answer is “Watkins.”  It offers as evidence this page.  That page poses the question What does the word deuce mean? but the answer has nothing to do with my information need.  Also, the word Watkins never even appears on that page, so no idea where it came from.

In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump.  In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task.

Read the rest of this entry »

There is nothing unusual about verbing nouns in English.  Despite the fact that your English teacher may have told you not to do this, it is common practice, especially on the intarwebs.  Verbing brand names to mean the primary action performed by the chief product of that brand is less common, but we all know about “googling.”  Just sitting here, trying to drink my morning coffee, I couldn’t come up with another example.

But what got me thinking about this is another example used in today’s User Friendly.  One character says,

“You’re gonna ebay it to goths, aren’t you.” [emphasis mine]

I had never heard the brand name ebay used in verb form, meaning to sell something on ebay (the primary function of their chief product).   It is not uncommon, though.  Searching the Google for +”to ebay it”, I found that at least 10% of the top few pages of results were just this construction (versus “to ebay.  It …”).  I estimate from that there are about 19,000 uses of ebay as a verb in this context, and no doubt many others in variations (e.g. “I ebayed my watch”).

Another example that just occurred to me, but which is pretty artificial, is to twitter, meaning to post something on Twitter.  I say this is artificial because Twitter openly encourages and suggests this terminology.  It was not an emergent construct, but an imposed one.  It has been adopted by the overwhelming majority of users, though.  [follow me on twitter]

So here is my question:  does this only work for Internet companies?  I’m probably forgetting some obvious brick-and-mortar company for which we have verbed their brand, so please tell me if I have.  Or is it that Internet companies are especially conducive to this construction because so many Internet companies start off with only one service and become known by that service.  Google is search, ebay is selling crap through auctions, twitter is … twittering.   If this only works for Internet companies, why did we start doing it in the first place?

And I just came up with a brick-and-mortar example:  hoover.  You can hoover down a plate of food, meaning to suck something up like a champ.  But my classification still holds, that is the primary function of their chief product (or at least the main product that people know them by).  Marketing people have already taken this to heart, I’m sure.  You need an easy name that sounds like English.  Just like with scientific terminology, no one wants to Dinklefwat their dishes.

The article I mentioned the other day concerning a computer program that confirms dogs communicate has drawn attention from Language Log [first here, more here]. The first was more of a rant from Geoff Pullum that left me feeling like he’s just not much of a dog person (or at the very least, has a healthy skepticism of animal communication claims).  Actually I think he is more angry with the way the media covers this sort of research, but I should stop now before putting too many uninformed words in his mouth.  Mark Liberman goes much more in depth and actually picks apart the paper by Molnar, Kaplan, Roy, Pachet, Pongracz, Doka and Miklosi (the Hungarian scientists mentioned in my previous post).

For anyone interested in machine learning and/or animal communication, I think the Liberman post is worth reading. A few highlights are as follows:

  • no tests were done to see if the computer was significantly more accurate than humans
  • computer accuracy overall was 43% while human accuracy was 40%
  • the article is less about communication than it is about the physiological state used to produce the barks: that is, if a dog is emotionally stimulated, body in a lunging position, his bark will naturally differ from a resting dog

The first two points are important in that the pop science articles reporting the study misrepresented the impact of the research — not very surprising. The third point is more interesting to me, though I have never done anything with animal communication aside from learn about it briefly in a introductory linguistics class.  I had heard about gorillas who could communicate with sign language, and assumed the results were provocative but not controversial. It was fascinating to learn that whether gorillas are doing anything more than memorizing a set of signs that lead to rewards is still debated.

I saw a video via StumbleUpon the other day where a chimp and a human are shown a screen with numbers that are flashed quickly then converted to blank squares. The task is to touch the squares in descending order. The chimps can do it amazingly fast and humans screw it up big time. I attribute this to the idea that animals are “present” or “in the moment,” while humans tend to have a lot going on in their heads that distracts them from the real world. The chimp reacts to the present world, and the humans get bungled up by trying to sort the spatial configuration of the screen as they see it.  They are slowed down by converting the scene into a mental representation, rather than just seeing what is in front of them.  But I’m just theorizing…  I hope someone with more experience with cognitive science can enlighten this off-the-cuff opinion.

Returning to Mark Liberman’s comments about the physiological condition of the dog, I have to partially disagree.  First of all, I definitely agree that the dog’s physical position allows the proper bark to be made.  For Daedalus to produce his beagle howl, his body must be rigid and his head extended upwards.  I have tried to move his body to prevent this bark (because it’s loud as hell on a quiet street at 12am) and have managed to distort it.  However, it is still clearly recognizable as this particular type of bark.  I don’t really see, though, why the territorial instinct of warning other animals away from his territory (wolf heritage) would demand the body rigid, head-back position.  I think the bark demands that position in order to be made, much as our mouth must be in different configurations to make different sounds (ignoring trained exceptions like ventriloquism).  That’s not to say the “fight” body position and the “fight” bark are not interrelated.  After all, a human must be holding a sword and facing massive opposition to yell:

This is SPARTAAAAAAAAAAAAAAAAAAAAAAAAA!!!!!!!

In my previous post on cognate identification, I gave two definitions for cognates: strict and loose (orthographic). Strict cognates are words in two related languages that descended from the same word in the ancestor language. Loose cognates are words in two languages that are spelled or pronounced similarly (depending on the data consists of phonetic transcriptions or plain text). These two definitions help form the basis for how I choose to classify approaches to doing cognate identification, but the source of data is the bigger factor, in my opinion. The orthographic approach looks at plain text and attempts to do some sort of string matching or statistical correlation based on the written (typeset) characters of the language. The phonetic approach relies on phonetic transcriptions of words in the language. Phonetic transcriptions are usually done in the International Phonetic Alphabet (IPA) but any standard form of representing sounds will work. One such example is the Carnegie Mellon Pronouncing Dictionary. Phonetic approaches may use string matching techniques, but there are also a number of inductive methods based on phonology that have been tried to good effect.

So a good question might be why does the data being used matter so much to these techniques? Why not classify the two approaches as to whether they look for loose or strict cognates? Might there not be another way of classifying the approaches to cognate identification beyond these two? Or is there an entirely different set of classes that would better describe them? To answer the last two questions, I will say that there very well may be better ways of classifying these algorithms. As Anil pointed out in the comments to my last post, the two definitions lend themselves to different applications. From the papers that I read, it seemed that when researchers looked at plain text data, there was a completely different mindset than in papers where researchers used phonetic transcriptions. For the former, the goal was usually finding translational equivalences in bitext and for the latter the goal is more as an aid to linguists attempting to reconstruct dead languages or establish relationships between languages.

With plain text, it is very difficult to infer sound correspondences between two languages. In Old English, the orthography developed by scribes corresponded directly to the spoken form. As English changed over the 1000+ years since then, the orthographic forms of words have frozen in some cases and not in others. For example, the word knight was originally spelled cniht and the c and h were both pronounced. The divergence of orthographic and phonetic forms can result in any number of problems and so it influences the ways of thinking about the task. On the other hand, phonetic approaches suffer due to data scarcity. Obtaining phonetic transcriptions is expensive as it requires the effort of linguists or individuals with specific, extensive training in the area. There are ways of obtaining phonetic transcriptions automatically, but these methods are not perfect and so result in noisy data, making this data practically useless for historical linguists.

In my next post, I will go into orthographic approaches in more detail, describing some of the papers I looked at and the methods they used. After that, I will begin discussing phonetic approaches, which are more numerous. I will also begin to look at how machine learning is being used to tackle cognate identification.

View all posts on cognate identification.

Donna is watching Good Times right now on TVLand and because I can’t help but devote a certain portion of my brain to the TV whenever it’s on, I heard this juicy morsel:

Michael: Dad, you sure look nice.
James: Yeah, I’m clean as a chitterling.

And unfortunately, it’s been a few minutes, so maybe I’m misquoting a bit. Anyhow, I’d never heard that particular saying before. Chitterlings aren’t something you normally consider clean, so I hit the net hoping to find some info about it. I found the following:

  • “Ludacris lookin clean as a chitlin as XM throws him a birtday party…” [source]
  • “thankx 4 add n me clean as a chitlin huh? nice suit” [source]
  • “These slick Jewtown thieves could pick it clean as a chitlin’ in half an hour.” [source]
  • “I’m cleaner than a chitlin washed in Clorox.” [source]
  • “Nelly is cleaner than a chitlin.” [source]

So precious few examples to work from. It appears to be African-American in origin, since all of the examples are from AA sources. Of course, chitterlings are pork intestines ingested as food and apparently stink like a sonuva bitch. When I worked in a grocery store (in Greenville, SC), we sold gallon buckets of the stuff. When preparing them, you have to clean them before cooking. To be honest, the very thought is making sick. So now we have origin and speaker group. Problem solved?

If anyone knows of any other occurrences or where the phrase came from, please let me know (in the comments).

Semantics and Pragmatics is a new open-access, peer-reviewed journal focusing on the semantics and pragmatics of natural language.  While the journal isn’t focused directly on computational methods, they expect to also publish material relevant to philosophers, psychologists and computer scientists.  Hopefully it will be more than the occasional submission that is of interest to CS people.  I’m looking forward to the first issue.

A couple months ago, I wrote about Richard Hogg dying. He was a professor at the University of Manchester who edited the Cambridge History of the English Language and did a lot of work on Old English morphology. I had corresponded with him briefly a few months before he died about a lab project on computational morphology. I was making a morphological analyzer for Old English verbs. I’m actually still working on it and generalizing it to the rest of the language. Anyhow, as I said before, he was a nice and helpful guy and it was a shame to see him go.

Now, the International Society for the Linguistics of English (ISLE) has set up a scholarship in his honor. Early career scholars who are members of ISLE (membership can be applied for at the time of submission) are eligible. Early career means you either haven’t gotten your PhD yet or got it within the past two years. Masters and undergraduate applicants are acceptable, but the expected entrant is a PhD candidate/recent recipient. The paper may be on any research-related topic in English or English linguistics and will be judged on originality and the contribution of its results. The prize is £500 and the submission deadline is March 31, 2008.

The things Bush says are so awesome sometimes.  My new favorite quote is “Childrens do learn when standards are high and results are measured.” [source] At first the White House transcriptionists corrected the mistake, but then press secretary Dana Perino instructed them to include the mistake, saying that the integrity of the transcriptions is very important to her.  This is good.

Language Log brought this particular juicy quote to my attention and Mark Liberman has an interesting commentary on the nature of the grammatical mistake - one more common to children than adults.  He also has a clip you can listen to.  He goes on to say that Bush does pause after he says childrens but that there’s no indication he’s just made a planning mistake.  I’m not completely sure I agree here.  I don’t think he necessarily did, but it’s possible.  I’m curious whether he was reading from a teleprompter or piece of paper and misread it as children’s and then seeing the rest of the quote, paused because it didn’t parse at first and then plunged on ahead because he’s a public speaker and it’s better to just keep going than stop and visibly appear to be lost.

Anyhow, the interesting part of Liberman’s post is the reference to chilluns, which he attributes to some possibly fictional southern dialect.  It’s not fictional.  It’s called Gullah and it’s from around the Charleston area in South Carolina.  Interestingly, I have also heard some people use a similar form in the country around the midlands of South Carolina.  I’m not really sure how to transcribe it, but it’s sort of like chillren.  Unlike chilluns it’s usually not plural (at least not that I recall).  When I first heard it, I thought the speaker was joking and using covert nonsense speech, like many of the words my wife and I use together.   For example, Kek kek kek = Connecticut, Pennsyltucky = Pennsylvania (especially when referring to the more rural parts), and South Kakalakee = South Carolina.  (We didn’t make all those up, but they are parts of our private conversations.)

But you can actually find a lot of occurrences of chillren on Google, so it’s not all that uncommon.   It seems to appear in a lot of slave narratives (judging by the Google results), so it probably had its origins in the pre-Civil War era and has survived in some areas.

Language Log brought up the usage of the phrase another thing coming today.  This is the only way I’ve ever heard it or seen it used.  But it turns out, the original is another think coming.  The thing version is winning out on the interwebs, but the post on Language Log indicates that the two phrases may have been warring since their (mutual?) inceptions.  It’s no surprise to me that thing would replace think in this case, for simple phonological reasons.  The [k] in think is preceded by a voiced nasal sound (the vocal cords are vibrating) and then followed by a unvoiced velar stop (aka plosive, but essentially another [k] sound).  The phenomenon of assimilation occurs when a phoneme changes to reflect the surrounding phoneme(s).  In this case, the [k] probably originally became voiced, which would make it a [g] sound.  The [k] and [g] sounds are essentially the same, it’s just a difference in whether your vocal cords are vibrating.  So, assimilation generated thing instead of think in regular speech and since that is a well known word, people interpreted it as thing instead of think when they were first exposed to it.  From there it has been gaining steam.

Another interesting example of a similar nature is home in on versus the original hone in on.

This past spring I worked on a morphological analyzer for Old English verbs. To my knowledge, this has never been done using finite state transducers. As part of my search to find the current state of the art for this language, I emailed Professor Richard Hogg at the University of Manchester. He wrote the section of the Cambridge History of the English Language on Old English morphology. A lot of times, you’ll email a professor and it could take days for them to get back to you, especially if they are at a different university. Sometimes they don’t respond at all. But, Dr. Hogg was a very polite and helpful guy, saying my work sounded interesting and pointing me to the Stella group at the University of Glasgow. His section on morphology in the Cambridge History was also very helpful, so I felt quite grateful to the guy. I wish I could have known him better.

Read his extensive obituary in the Guardian.

Linguistic Issues in Language Technology (LiLT) is a new open-access journal in computational linguistics. The journal will focus on techniques that bring linguistics back into language technologies (LT). LT currently focus a lot on statistical techniques and sometimes can ignore linguistic insight altogether, but the field is beginning to swing around from the purely statistical approach to one that takes linguistic insight into account and merges it with statistical methods.

Curious about what sort of credibility this journal would have, I browsed the editorial staff and found some pretty big hitters. Following are some of the names that stood out to me. Christopher Manning of Stanford wrote the textbook used in my Language and Statistics class. Kemal Oflazer was one of my previous professors, who was visiting CMU last year. He’s done a lot of work with finite state transducers for morphological analysis of Turkish, among other things. Mark Liberman and Aravind Joshi of the University of Pennsylvania are pretty well known and accomplished. Aravind Joshi came up with Tree Adjoining Grammar and both he and Martin Kay won the ACL Lifetime Achievement Award. Mark Steedman is the current president of the ACL (Association for Computational Linguistics). Jason Eisner has done a lot of work on applying statistics to linguistics approaches and advised one of my current professors, Noah Smith. Philip Resnick has done a lot with word alignment and statistical machine translation.

Dr. Amit Almor of my alma mater and his team have used fMRI images of the brain to see just what is going on when people use pronouns versus proper nouns. They found that the spatial area of the brain lights up when proper nouns are used, implying that the brain builds a new representation each time. When pronouns were used, these areas did not light up.

So representations for proper nouns are merged as processing goes on. However, too many of them can result in disruption in the processing of new input. Interestingly, a similar phenomenon occurs in users of American Sign Language. Signers point to a space in the air as a reference point for proper nouns, a function similar to pronoun usage in spoken language. Pretty cool.

Read the rest of this entry »

Here is a word that has gone through the wringer in the past few decades. Originally it meant a monstrous offense or excessive wickedness (American Heritage). However, its similarity to the word enormous has caused it to be used by an ever growing number of people to mean immense size. With all things language, attempting to turn back the natural tide of almighty usage is futile. For example, the title of an article just posted on the National Geographic website is “Angkor’s Ancient Enormity Uncovered“. I was disappointed when the story wasn’t about a mass sacrifice or other such atrocity.

Wired has a story on the fact that almost all the mice used in laboratory research today are descended from a few inbred mice about a hundred years ago. It seems like there are advantages to having inbred mice in terms of experimental control, which may have been part of the original motivation. The fewer factors that change from experiment to experiment the better you can isolate /attribute causality. But of course, the criticism here is that the lack of diversity in test-mouse genetics may be the reason that problems with certain drugs didn’t become apparent until after the drugs hit the market.

All this interesting stuff about mice aside, we are then given a quote by the illustrious Fernando Pardo-Manuel de Villena:

“To make an analogy between mice and humans, using the classical inbred strains is like doing studies on 10 people selected from one small town in Appalachia.”

Read the rest of this entry »

Now here is an interesting word. The Language Log has a few examples of its usage and a couple of theories as to how it came about. The first is that the archaic word howsomever, which was an alternate form for however, changed through metathesis to howmsoever. I’m more inclined to the theory that it is derived by analogy from whomsoever/whosoever. I can imagine a scenario where someone wants to drag out however to make what is about to be said more significant (or some such scenario). It seems like howmsoever would fill that need nicely. But who knows..

About Me

Jason M. Adams

My name is Jason M. Adams and I recently graduated with my masters from the Language Technologies Institute at Carnegie Mellon University. My main areas of research were with recommender systems and word sense disambiguation. Now I am on the job market. And I am obsessed with my two dogs.

Calendar

July 2008
S M T W T F S
« Jun    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Archives

Site Statistics

  • 68,437 reads

Site Information

Contact me: jaso...@gmail.com

Creative Commons License

This work by Jason M. Adams is licensed under a Creative Commons Attribution 3.0 License.

Header image credit seakwenby.

Random Crap