Posts Tagged ‘cognate identification’

In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump.  In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task. (more…)

In my previous post on cognate identification, I gave two definitions for cognates: strict and loose (orthographic). Strict cognates are words in two related languages that descended from the same word in the ancestor language. Loose cognates are words in two languages that are spelled or pronounced similarly (depending on the data consists of phonetic transcriptions or plain text). These two definitions help form the basis for how I choose to classify approaches to doing cognate identification, but the source of data is the bigger factor, in my opinion. The orthographic approach looks at plain text and attempts to do some sort of string matching or statistical correlation based on the written (typeset) characters of the language. The phonetic approach relies on phonetic transcriptions of words in the language. Phonetic transcriptions are usually done in the International Phonetic Alphabet (IPA) but any standard form of representing sounds will work. One such example is the Carnegie Mellon Pronouncing Dictionary. Phonetic approaches may use string matching techniques, but there are also a number of inductive methods based on phonology that have been tried to good effect.

So a good question might be why does the data being used matter so much to these techniques? Why not classify the two approaches as to whether they look for loose or strict cognates? Might there not be another way of classifying the approaches to cognate identification beyond these two? Or is there an entirely different set of classes that would better describe them? To answer the last two questions, I will say that there very well may be better ways of classifying these algorithms. As Anil pointed out in the comments to my last post, the two definitions lend themselves to different applications. From the papers that I read, it seemed that when researchers looked at plain text data, there was a completely different mindset than in papers where researchers used phonetic transcriptions. For the former, the goal was usually finding translational equivalences in bitext and for the latter the goal is more as an aid to linguists attempting to reconstruct dead languages or establish relationships between languages.

With plain text, it is very difficult to infer sound correspondences between two languages. In Old English, the orthography developed by scribes corresponded directly to the spoken form. As English changed over the 1000+ years since then, the orthographic forms of words have frozen in some cases and not in others. For example, the word knight was originally spelled cniht and the c and h were both pronounced. The divergence of orthographic and phonetic forms can result in any number of problems and so it influences the ways of thinking about the task. On the other hand, phonetic approaches suffer due to data scarcity. Obtaining phonetic transcriptions is expensive as it requires the effort of linguists or individuals with specific, extensive training in the area. There are ways of obtaining phonetic transcriptions automatically, but these methods are not perfect and so result in noisy data, making this data practically useless for historical linguists.

In my next post, I will go into orthographic approaches in more detail, describing some of the papers I looked at and the methods they used. After that, I will begin discussing phonetic approaches, which are more numerous. I will also begin to look at how machine learning is being used to tackle cognate identification.

View all posts on cognate identification.

I recently finished a literature review for my Language & Statistics 2 class. The topic was computational models of historical linguistics and my partner and I focused on cognate identification and phylogenetic inference. We split the work and my part was cognate identification. So I decided to blog about it for a bit and maybe someone out there will have something to offer. Granted, that won’t help my grade, but improving my understanding is more important. You can also check out our presentation.

First of all, to frame the problem, historical linguistics is a branch of linguistics that studies language change. Language can change in many ways, but the methods we looked at pretty much solely focused on phonological and semantic changes, with a few brief nods to syntactic change (on the phylogenetic inference side). The main tool used by historical linguists in reconstructing dead languages is the comparative method. This method looks at two languages suspected of being related and tries to infer the regular sound changes that led to the divergence. By examining lists of suspected cognates, they find sound correspondences — sounds that appear in similar contexts in both languages, but which aren’t necessarily the same phoneme. For example, the word for beaver in English and German derives from the Proto-Germanic word *bebru. In Old English, this became beofor (the f sounds like a /v/). In modern German, the word is Biber, with the /b/ phoneme preserved as it was in Proto-Germanic. So we could infer a sound correspondence between English /v/ and German /b/ in this context.

So what are cognates? If you have studied a second language, you no doubt have heard this term. I propose the following two classifications for cognates. A loose cognate will be a pair of words in two languages that is spelled or pronounced the same, with some minor variations. In this way, French resumé and English resumé would be considered cognates. Loose cognates have also been called orthographic cognates. A strict cognate is a pair of words in two related languages that descended from the same word in the ancestor language. Loan words are words that come into a language directly from another language, such as resumé. These words do not undergo the regular sound changes that are observed in strict cognates and so they are not considered cognates at all by historical linguists.

What is the effect the distinction between these two definitions would have on computational approaches to this task? I will look at this further in a future post, but feel free to post your thoughts in the comments.