You are currently browsing the monthly archive for December, 2007.

Among the several online comics I read on a regular basis is the Saturday Bulletin.  It’s a great comic that takes an old-timey picture and reinterprets the content.  So today’s comic is of Santa tromping off over a hill in the snow, saying “mmm… venison.”  Blitzen isn’t feeling so good about it.

The thing that struck me was that it referred to the reindeer “Donner” as “Donder”.  Donner is the German word for thunder and Blitzen is lightning.  I knew this from my time learning German so it made complete sense and seeing “Donder” struck me as a mistake.  I have heard children say “Donder” before, so I thought this was a production mistake bleeding over into adulthood.  Lo and behold, Donder is the older version of Donner’s name (Donder being an older spelling of the German word).  Originally the pair of reindeer were the Dutch forms Dunder and Bliksem.  Check out the Wikipedia entry and the Donder home page, an amusing campaign to restore the name of Donder.

I found this in a bathroom of a Taco Bell near Sutton, West Virginia. Not only is it misspelt, but there is something wonky with the logic here. Why should I be courteous and lock the door? Isn’t the social norm that people knock before entering a bathroom? Granted, they would assume it’s a multiperson bathroom in this case, but a sign could be put on the door: “NOK FRSIT.” Also, doesn’t it seem to assume that you won’t mind much if someone barges in on you wiping yourself, but that the barging party will? A better sign would remind you that you need to lock that door.

Be Courtious - Bathroom sign

While helping my mother with some pictures, I found a stash of old pictures of Daedalus from when he was a puppy. These were his “toddler” days. He was just insanely cute, wasn’t he? Nellie is the Italian Greyhound and belongs to my middle sister.

Daedalus as a toddler #1

My lemon beagle Daedalus as a toddler 2

My lemon beagle Daedalus as a toddler 3

One of the things my mother got me was a nice, little digital camera. I didn’t want something super-high quality that would cost a fortune, but I wanted something good, small and that I wouldn’t have a heart attack if I broke. I wanted something that would slip in my pocket that I can (hopefully) use for photowalking. My cell phone has been what I’ve been using lately, but while it’s decent for a phone, it’s just not that great. So my new 8 MP Kodak EasyShare M883 will do the trick. I took some pictures of a bouqet of roses to play around with it.  To continue the lightweight theme, I used Picasa 2 from Google for the editing. The end product is kinda cool I think.

Baby’s breath and roses

I also got Battlestar Galactica: Razor and Scene It? 2nd Edition as well as a few other things. Good times.

Courtesy of Chaotic Utopia:

I think this should be required reading for any novice programmer and probably even more so for established programmers. Agree with him or not, I think you’ll agree that Steve Yegge has some interesting things to say. My favorite quote:

“Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.”

This is especially interesting to me as I just jumped on the IDE bandwagon.  I received a few interesting comments  on that post that are worth reading.  A minor theme was the fact that you just can’t handle a massive code base without some kind of IDE (Integrated Development Environment).  I have worked with a code base of about 20,000 lines of Java with no IDE and there were certainly challenges.  I have also worked with a code base of over 100k lines of C (not ++) and that was a pain in the butt.  Massive changes took me days to complete and then weeks to debug.  Having an IDE would have made it easier, but it also would have made it much larger.  It is so easy to bloat up code with every kind of get/set method and constructor there is, but many of them are never used.  Is that a bad thing or just good future planning?  There is definitely a trade off, and one that probably comes down on the side of bad thing more often than not.

In any case, it’s something I have to keep in mind as I go forward with my new project.

I am a fan of good beer. In this post I am going to talk about my ideas for how to improve websites that offer ratings for different varieties of beer, and how I think recommender systems would improve their service.

Why I care

Whenever I’m asked what kind of beer I like, I experience a moment of awkwardness. Because I don’t just like good beer, I hate bad mediocre beer. Usually the person asking is a beer noob and I don’t want to sound too snobby by throwing beer jargon at them they probably won’t understand. So I say something along the lines of “I like the more expensive stuff, like from small breweries or imports.” The response is usually something about Sam Adams. I used to enjoy Sam Adams Boston Lager, but I can barely stomach it anymore. There are a couple Sam Adams brews that aren’t half bad, but the Boston Lager no longer cuts it for me.

Read the rest of this entry »

Polar bears depend on the Arctic ocean habitat for their survival and things are looking pretty grim for them with global warming, toxins in the food chain, and disturbances due to oil and gas drilling. The US Geological Survey has said that they could be extinct by 2050 if we do not take steps to protect them now.

If you’re a US citizen and you give a crap, I urge you to send a message to Secretary of the Interior Dirk Kempthorne to have the polar bear added to the endangered species list and some lands designated as a critical habitat necessary for the bears’ survival.

I just saw a post on Statistical Modeling dealing with some of the worst use of statistical graphics this year. Be sure to check it out. I’d have to say I agree with that assessment. The case deals with two pictures of a road during the Crimean War. In the first picture, there is an road covered in cannonballs. In the second, the road is clear. Errol Morris challenged his readers to figure out which picture came first. The correct answer is the clear road.

Morris uses pie charts and bar graphs to display the reasons people gave for their decisions. While colorful, these graphs are also meaningless. So given the data, I z-normalized the on choices and off choices (made it so their distributions had mean 0 and standard deviation 1). I used the same bar graph setup (except horizontal this time). Since I normalized each distribution, the actual quantity of voters one way or the other no longer really makes a difference. I am just comparing the relative preference by one side or the other for a given reason. This assumes that there is some significance to a person not choosing a particular reason, which may be incorrect.

Click to enlarge the graph if it’s not properly visible:

Errol Morris discusses data on people’s decisions about two photographs from the Crimean War.

So what I think my chart shows is that shadows are the worst feature to choose for correctly guessing which came first. People who focused on either the shelling or characteristics/artistic features were more likely to choose correctly.  The most confusing feature is the number and position of the balls.   Also confusing were practical concerns.  If I were going to train a support vector machine to classify images of this type, I would use the three features: shelling, characteristics/artistic and shadows.

So what do you think? Am I way off on trying to normalize these and make this kind of assessment? I am, after all, a statistics amateur.

I knew what eminent domain is, but I didn’t know what cities are using it for now.  How many made for TV, comedies, and kids movies have there been where the  hero/heroine has to stop the evil developers from moving in and destroying the quaint hometown/small mom-and-pop store/diner/wildlife preserve?  There were probably more than a dozen episodes of Scooby Do that used this theme.  Who knew instead of dressing up like a ghost, they could have gotten Old Man Parker to move out just by appealing to the corruptibility of city officials?  Who needs local flavor when you can have clonetown and lots of tax dollars?

These cities are the new Sheriffs of Nottingham:  steal from the little guy to give to the development conglomerate.

When you subscribe to a crapload of feeds that have overlapping subject matter, you see interesting themes emerge. In the astronomy subblogosphere, the recent news about the double galaxy 3c321 has sparked yet another competition over who can come up with the coolest headline. In case you haven’t heard about it, 3c321 consists of two galaxies, one of which is shooting a jet of particles at the other (via its black hole) which could strip the atmosphere off any planets in that galaxy. Here are the headlines I have collected in the wild:

  1. Bad Astronomy: Taste my death ray, 3c321!
  2. Space.com: Galaxy blasts neighbor with deadly jet
  3. NASA: ‘Death Star’ galaxy black hole fires at neighboring galaxy
  4. NASA Image of the Day: Black Hole Bully
  5. Discovery News: Galaxy zapping neighbor with deadly beam
  6. National Geographic: ‘Death Star’ galaxy found blasting smaller neighbor
  7. Celebritycraps: Black Hole ‘Owns’ Galaxy!!!
  8. Cumbrian Sky: ‘Death Star’ galaxy lets rip…
  9. BBC: Black Hole ‘bully’ blasts galaxy
  10. ArsGeek: I swear, some peoples galaxies…

And the list goes on with variations on the theme. Almost as shocking as the campy puns are the multitude of posts that just regurgitate titles from the major news outlets.

Now witness the firepower of this fully armed and operational battlestation!

This phenomenon is not limited to the domestic abuse in double galaxy 3c321. I have observed it occur again and again. I suppose it comes from probably three different causes: catchy headlines attract readers, blogs are supposed to be creative outlets and so bloggers try to be creative (and I guess newspaper editors as well), and a natural desire by people to show off their wit. I decided to combine all three by going over the top and using a fake word just to make it rhyme. The result attracts readers, is creative, shows off my prodigious wit, and thumbs a cynical nose at the blagoblag for its absurdity while ironically increasing said absurdity. Insert arrogant, fake, British-gentleman laugh here.

As a side note, wouldn’t be interesting if we’re actually witnessing a galactic war between two ridiculously advanced civilizations who don’t mind taking millions of years to kill each other?

Well, after many frustrating months of waiting for Twitter to finally fix their gmail contacts import feature, I have finally done it!  Surprise, only two contacts were signed up — and that’s two more than I expected.  However, one of those is a professor who probably only checked them out because they’re using his technology and the other was a friend who had only one update:

“nothing.”

Social pressure from me caused him to add another update.  That’s what I tell myself anyway.

What is Twitter, you ask?  It’s basically Facebook status updates made global.  Indeed, you can even add a Facebook app that allows Twitter to update your status.  Of course, it means you get “is twittering: ” inserted at the beginning of any tweet (a single Twitter status update) as your status update.

While Twitter at first seems like status updates on steroids, it’s actually evolving into something else far more useful.  I’ve talked before about the information diaspora and the difficulty of keeping up with all your personal information as it flies around the web.  Twitter at first adds to that mess, but it does offer interesting ways of tracking small bits of information.

Erin McKean, the Dictionary Evangelist, uses it to keep track of new words she comes across.  Twitter lets you text updates from your cell phone or IM client so it’s easy to update on the go.  Robert Scoble uses it as a sort of mini-blog of things he comes across or finds out about that wouldn’t really make a full-fledged blog post.  So Twitter has uses for logging your web surfing, hobby, life activities, etc., which is a useful information diaspora reducing measure in my book.  The only question remains whether this would be of any use to you.

Check me out and follow my updates on Twitter.  If you haven’t signed up, consider it.  If you do, let me know so I can follow you.

Wright Flyer I

The Wright Brothers had their first successful flight on December 17, 1903. The flight lasted for 12 lousy seconds, but a machine that was heavier than air that they had built stayed under control off the ground. Whether they were actually the first or if their flight was even long enough to be valid, it is undeniable that they have had a massive impact on aviation and the world. They didn’t have to use space age polymers or special blend of fuels. Just ingenuity and hard work. They fabricated a gasoline engine in their bicycle shop and built the body out of a spruce tree. Pretty cool.

It has always struck me as amusing just how worked up people can get over the right to say “first in flight.” I guess because it was such a monumental achievement at the time, we have lost sight of the wonder that must have accompanied it back then. Man had conquered the sky! Of course, this feeling didn’t emerge until a few years later since the Wright Brothers were generally considered to be hoaxsters around the world until a demonstration in France in 1908.

SR-71 Blackbird

I had a friend from New Zealand who claimed a countryman had been the first to fly. Ohio and North Carolina both try to take credit for the Wright Brothers. The first plane was developed in Dayton, Ohio, but actually flown in Kitty Hawk, North Carolina. I’ve been to Wright Patterson Air Force Base, which is now home of the National Museum of the Air Force. One of the coolest things I saw there was a decomissioned SR-71 Blackbird, a supersonic spy plane. We weren’t supposed to touch it, but I wanted to so much (I was like 12 or 13) and my uncle just said to go ahead and do it. The worst that could happen is we’d be kicked out, right? But nothing happened, and it felt like slightly cool, smooth metal. It was great.

After the Wright brothers patented their invention, there were years of disputes over patent violations. Continued claims of being the first in flight sprang up. After their first little hop on this day in 1903, they maintained a high level of secrecy so that government-backed researchers, who were the big players in the game, couldn’t steal their ideas. They were just two guys who ran a bike shop and wanted to make money on it. I think that was the right thing to do. If they hadn’t been secret, they would have been squashed like bugs by the big guys. I think the real legacy of the Wright brothers should not be the controversy or the secrecy, but the fact that two dudes in a bike shop, tinkering with wood and engines, changed the world in 12 seconds.

Boredom and insomnia led to the following:

I Am A: Neutral Good Human Wizard (4th Level)

Ability Scores:

Strength-11
Dexterity-13
Constitution-14
Intelligence-19
Wisdom-12
Charisma-12

Alignment:
Neutral Good A neutral good character does the best that a good person can do. He is devoted to helping others. He works with kings and magistrates but does not feel beholden to them. Neutral good is the best alignment you can be because it means doing what is good without bias for or against order. However, neutral good can be a dangerous alignment because because it advances mediocrity by limiting the actions of the truly capable.

Race:
Humans are the most adaptable of the common races. Short generations and a penchant for migration and conquest have made them physically diverse as well. Humans are often unorthodox in their dress, sporting unusual hairstyles, fanciful clothes, tattoos, and the like.

Class:
Wizards are arcane spellcasters who depend on intensive study to create their magic. To wizards, magic is not a talent but a difficult, rewarding art. When they are prepared for battle, wizards can use their spells to devastating effect. When caught by surprise, they are vulnerable. The wizard’s strength is her spells, everything else is secondary. She learns new spells as she experiments and grows in experience, and she can also learn them from other wizards. In addition, over time a wizard learns to manipulate her spells so they go farther, work better, or are improved in some other way. A wizard can call a familiar- a small, magical, animal companion that serves her. With a high Intelligence, wizards are capable of casting very high levels of spells.

Find out What Kind of Dungeons and Dragons Character Would You Be?, courtesy of Easydamus (e-mail)

At a meeting at school last week, we discussed several ideas about just what made games fun. There were a variety of topics, from the definition of a game (versus a puzzle, a challenge, etc) to gender differences in game appeal. What interested me the most is the actual theory about what it is in a game or puzzle that translates to the experience of fun in the human brain. Many of the ideas we discussed came from the book A Theory of Fun for Game Design by Raph Koster. I am just beginning to think about these things, so the ideas expressed below are sketches and are still evolving (actually, I believe every idea should be either evolving or evolvable).

One of Koster’s main points is that the human brain is a pattern matching machine. We find patterns everywhere, be they sequences of moves in chess or where to aim in first-person shooters. When the patterns are too complex or are just random (essentially the same thing from an information theoretic standpoint), the brain grows bored or annoyed. When the patterns are too simple and we discover them all (as in Tic-Tac-Toe), that is also boring. Koster says the following:

“If I were Will Wright, I’d say that ‘Fun is the process of discovering areas in a possibility space.’”

My advisor made the point that there is a certain zone you get into during a game where everything else goes away and you are totally focused. This can also happen with an Excel spreadsheet or coding (or knitting or playing sports). So we seek the zone. I think the zone is a state of mind where your brain is attuned to certain patterns and you are able to find new, related patterns quickly. The zone is a feedback loop that arises from the successful discovery of certain initial patterns that allows you to solve increasingly complex patterns.

When I used to play Halo, the zone was the most awesome experience of all. I would charge a base, kill all the guys between me and the flag, grab the flag, rush back out and zoom back to my home base to score. It was addictive (so addictive I had to get rid of Halo).  So how do you make a game that encourages finding the zone?  Koster’s book has some tips, which I won’t go into, but it’s worth checking out.

It snowed yesterday enough to cover the ground, but freezing rain turned to rain overnight and it was all gone by morning. It’s back to flurrying today and there was one stretch that was particularly beautiful as the snow was whipped around by the wind. It could have been a blizzard if it lasted for more than 10 minutes.

Snowing

That picture came out better than I expected, actually. I’ve had difficulty in the past getting snow to show up on film and be anything more than a gray fuzz in the air. I used flash with the “action shot” setting on my camera. Speaking of cameras, I need a new one. This one is going on four years old. And I guess film is the wrong word there since this is a digital camera, so what I would call it? Showing up on silicon?

Willow looking at the snow

Willow in the Window

Google Reader recently added some social networking features. You can now add your friends’ shared items to your feeds. Up til now I haven’t used the shared items feature since it didn’t really make sense to send people a link to my shared items and expect them to give a crap. Now it’s easier to subscribe to it and the decision to give a crap is left up to them when they read their feeds.

As Robert Scoble pointed out, there is one major flaw with the new feature: it clutters up the rest of your feeds. This is, of course, assuming you read your feeds in the “All feeds” folder. I usually don’t since I have a number of Tech News feeds that I’m not always interested in and often has duplicated information. I only read it when I have time. You can do the same with your friends shared items, so no big deal to me.

So why not check out my shared items?  As it happens, I have none at the moment.  But if you use Google Reader, feel free to add/invite me so we can view each others.  I’ll accept any invitations.

I have talked about dictionaries in the past, so you might know that I have a certain fascination with them. One of the best things about the interwebs is the ability to access information about just practically anything in a very short time. If someone mentions some sort of literary reference in a chat, one quick jump to Wikipedia and I can instantly be up to speed. Or if someone uses a word I can’t remember or don’t know the definition of, I can pop over to dictionary.com and quickly discover the missing piece of information.

But just how quickly? I timed the following process for five different words:

  1. open a new tab in firefox
  2. enter dictionary.reference.com in the address bar (the address autocompletes, so I’m only type di, down arrow, enter).
  3. enter the word and wait for the definition

This takes about 7 seconds per word. Part of the slowness is the fact that there are about a bazillion ads on dictionary.com. Sometimes I start typing but not all of the ads have finished loading so the javascript hasn’t put the focus in the word box. The result is that half the word is missing when the focus finally goes in there and I have to start over. In those cases, I expect the average time jumps up to more like 10-12 seconds. This is also annoying.

Enter ninjawords.com. Average time per lookup using the above method is 3 seconds. There are no ads. The instant step 2 is done and I start typing, the text box has focus and in under a second after hitting enter the definition is displayed. Beautiful. Plus, I can separate multiple words by commas and get more than one definition at a time, saving me from repeating steps 1 and 2.

Unlike dictionary.com, ninjawords uses Wiktionary. So no longer do I have the research potential of seeing the Indo-European roots of words and there is always the potential for vandalism to seep in and corrupt a particular definition, though that has a low probability. If I need definitions with authority, I can always resort to dictionary.com. If I need them with speed for use in fast-paced settings (like in the middle of an IM session), I can use ninjawords.

Greg Linden and Daniel Lemire have both written a little about the Netflix Prize and whether the systems that are doing the best are really worth anything. The KorBell system that recently won the Progress Prize consists of 107 different parts in an ensemble system (Note: the team of Bob Bell and Yehuda Koren at AT&T goes by BellKor and KorBell on the Netflix leaderboard). The paper is interesting for two reasons: the ensemble method being used and the fact that only about 3 or 4 of those components are doing the heavy lifting. Actually, I have no idea whether the actual ensemble algorithm they use would be especially interesting to anyone else, but as I have no experience with ensembles in this context, it was interesting to me.

Read the rest of this entry »

D’Arcy Norman has a great post on the Creative Commons license and why to choose only CC:By rather than some of the stricter licenses. This blog has been using CC:By since I began it, so it was nice to see someone with similar sensibilities putting out a much clearer statement of why it’s a good idea than I could.

People are free to reuse anything I write on this blog or the pictures I take, may God help them, so long as they attribute what they have done to me.  So far the only ones who have used my stuff are spam blogs that mysteriously republish just about everything I write, often attributing it to some other author’s name but posting a link back to my site. Very bizarre.

Japan is hunting whales again apparently. Sign the petition and maybe save some of these poor creatures.  The website apparently is hard on browsers so save any open work.

[via dpn]

Reuters is reporting on a Russian website, CyberLover.ru, which has made a chatbot that supposedly can romance women in chat rooms into giving up their digits. Unfortunately, the whole thing is in Russian, so I can’t evaluate it. I am really curious what sort of output this thing gives. I think it also points out just how desperate people in chat rooms are to connect with someone. Assuming, of course, it actually works.

Whenever I’m in the lab and mistype my password logging into my laptop, there is an insanely loud beep from the PC speaker. Why not use the actual speakers on the machine rather than resorting to the PC speaker, a relic from the times when computers and dinosaurs walked side-by-side and computers had to be loud in order to be heard over the rumbling of the earth? Tonight I was messing around on the command line in MySQL and entered a bad command only to have my ears blown away by this 270 decibel dinosaur-alerting screech.

So I went searching for a solution to my problem and I was willing to do anything — even if it meant opening my system and ripping out the little speaker’s still-beeping heart. I gotta hand it to Microsoft, though, they make things easy. Psyche!

Under Control Panel > System > Hardware > Device Manager, you get a screen like so:

device manager

You would think that the PC Speaker would be under “Sound, video and game controllers”, but you’d be wrong. PC Speaker is hidden under System Devices. Disabling that does absolutely nothing. This is because Microsoft practices something called function obfuscation. Basically, if you expect something to do something because doing so would be intuitive, the actual function is performed by something else.

The Microsoft developers had this conversation:

Bob: Ok, we need to add the PC Speaker to the Device Manager.
Jim: I think we should add it to “Display Adapters” since it is displaying sound in the air.
Bob: Good point.
Jill: Wait, that is really messed up. People might guess that.
Bob: I just had an idea. People might guess that.
Jill: That’s what I said.
Bob: Be quiet, Jill, men are talking.
Jill: <storms out of the room>
Jim: I know, let’s make it a hidden option called Beep.
Bob: Brilliant. It’ll be years before anyone finds it.

To make a long story about a really boring topic that just totally pissed me off so I had to vent short:

Under View, choose the option “Show hidden devices.” This will reveal the “Non-plug and play devices” node in the tree under which is the “Beep” device. Click on the Driver tab and click “Stop” and under Startup choose the type as “Disabled”. Now wasn’t that easy?

The dog park over winter has no operating water fountains, so we either have to bring in water or find it.  Willow prefers to find it.  Granted, she was really hot (and thirsty) after our first round of frisbee.  School is just about over (minus one final) so I’ve been working from home for the past couple days.  The result is that my dog obsession is peaking again.

Willow in a puddle

I wrote about Predictify a while back. It’s basically a website that pays users for predicting world events. When I first wrote about them, I presented the sample question: “How long will Michael Vick’s sentence be?” Well, the verdict came down and my prediction was very close. I predicted 24 months and the dirty bastard got 23. Total payout for me: $6.07. Not bad for 30 seconds effort.

The site appears to be doing well. Currently, there are 24 open polls with large cash payout potential. I was pretty skeptical it would succeed (and that still has yet to play out fully), but it would seem that the guys who predicted its success are going to be looking at nice payouts of their own.

To my dog Daedalus, comfort is being close to Master. He often forsakes comfortable beds for our laps. He crawled up on the chair today while I was on the computer and plopped down.

Daedalus on my lap

Because resting his head on a laptop is more comfortable than his plush doggy bed. This position got annoying for me very quickly, so I moved him to the other chair and made him a little bed.

Daedalus on the chair

Up until recently, I was pretty old school with how I write my code. Vim, baby. No code completion only syntax highlighting. I had used a couple IDEs back when I first started taking classes, but just found them cumbersome. For my software engineering class this semester and for my project next semester, I will have to use Eclipse. This gave me my first opportunity to use Subversion in a team environment as well. I must say, I was seriously missing out. Eclipse has been really fun to use and being able to check in code and keep track of changes has been invaluable. Now I’m subversioning all of my side projects. Overkill? Well, this way I can instantly sync between my laptop and my school linux machine. I use the school machine to run experiments since it’s running four 3 GHz processors whereas my laptop has only two 1.8’s.

Anyway, last night I decided to check out the new version of the NetBeans IDE (6.0). I had used it very briefly in the past and found it to be a slow resource hog. It’s still a hog, but runs fine on my laptop with 2 GB of RAM. Plus the new features are pretty awesome. Not only does it have code completion, but code suggestions and instant generation of get/set methods for your class variables. Also, like Eclipse, it has subversion support built-in.

Which IDE do you use? I’m admitting here and now to being a noob to the IDE world, so is there something better?

In my previous post on cognate identification, I gave two definitions for cognates: strict and loose (orthographic). Strict cognates are words in two related languages that descended from the same word in the ancestor language. Loose cognates are words in two languages that are spelled or pronounced similarly (depending on the data consists of phonetic transcriptions or plain text). These two definitions help form the basis for how I choose to classify approaches to doing cognate identification, but the source of data is the bigger factor, in my opinion. The orthographic approach looks at plain text and attempts to do some sort of string matching or statistical correlation based on the written (typeset) characters of the language. The phonetic approach relies on phonetic transcriptions of words in the language. Phonetic transcriptions are usually done in the International Phonetic Alphabet (IPA) but any standard form of representing sounds will work. One such example is the Carnegie Mellon Pronouncing Dictionary. Phonetic approaches may use string matching techniques, but there are also a number of inductive methods based on phonology that have been tried to good effect.

So a good question might be why does the data being used matter so much to these techniques? Why not classify the two approaches as to whether they look for loose or strict cognates? Might there not be another way of classifying the approaches to cognate identification beyond these two? Or is there an entirely different set of classes that would better describe them? To answer the last two questions, I will say that there very well may be better ways of classifying these algorithms. As Anil pointed out in the comments to my last post, the two definitions lend themselves to different applications. From the papers that I read, it seemed that when researchers looked at plain text data, there was a completely different mindset than in papers where researchers used phonetic transcriptions. For the former, the goal was usually finding translational equivalences in bitext and for the latter the goal is more as an aid to linguists attempting to reconstruct dead languages or establish relationships between languages.

With plain text, it is very difficult to infer sound correspondences between two languages. In Old English, the orthography developed by scribes corresponded directly to the spoken form. As English changed over the 1000+ years since then, the orthographic forms of words have frozen in some cases and not in others. For example, the word knight was originally spelled cniht and the c and h were both pronounced. The divergence of orthographic and phonetic forms can result in any number of problems and so it influences the ways of thinking about the task. On the other hand, phonetic approaches suffer due to data scarcity. Obtaining phonetic transcriptions is expensive as it requires the effort of linguists or individuals with specific, extensive training in the area. There are ways of obtaining phonetic transcriptions automatically, but these methods are not perfect and so result in noisy data, making this data practically useless for historical linguists.

In my next post, I will go into orthographic approaches in more detail, describing some of the papers I looked at and the methods they used. After that, I will begin discussing phonetic approaches, which are more numerous. I will also begin to look at how machine learning is being used to tackle cognate identification.

View all posts on cognate identification.

I’m going to officially coin the term information diaspora to mean the dispersion of individual personal preference information throughout the web. Whenever you sign up for an account, you leave a part of your personal information somewhere. Whenever you enter an address to order a book, more information. When you look through digg comments and you thumbs-up or thumbs-down a comment, more information. Whenever you favorite a video on youtube, leave a wall post on facebook, rate a movie on netflix, more information. All of this information is accessible to you as long as you can recall where you have left it. If you forget about a website you signed up for, that information is now missing. It’s not dead or gone, just missing.

Your brain is no longer the homeland of all these orphaned data. Social networking is great, but with the current Web 2.0 bubble expanding the way it is, the inherent incompatibility in the global network is becoming more and more a problem.

Read the rest of this entry »

Donna decorated the apartment a couple days ago and it is looking quite festive. I personally love it. I think she managed to incorporate our current colors and furniture quite well and the tree is beautiful. I’m definitely in the Christmas spirit now. :)

Christmas tree 1

Christmas tree 2

Christmas tree 3

Take the geek test and find out. Just a note of warning — the test is not short. I am a super geek, coming in at 45.16575%.

i am a super geek

If you do happen to take it, don’t forget to leave me a comment with your score.

One of the dark horses of the inner solar system makes its closest approach to Earth since it was discovered in 1983 soon.  Phaethon is an asteroid (perhaps the burnt out core of a comet).  We pass through its debris trail every December, resulting in the Geminid meteor shower.  This year, the Geminids will peak on December 13-14th.  Bonus:  the Geminids are likely to be even better than the Perseids this year.  Unfortunately, it’s cold out.  Plus I have an exam on the 14th.  This meteor shower didn’t get the memo I sent out that it had to fall on a weekend.

So what’s special about the Geminids?  Phaethon is a source of denser meteors than are found in most other meteor showers.  This results in meteor paths that can be jagged and more meteors that break apart and split.  According to Space.com, the Geminids have a history of slow, bright meteors and faint meteors, but few medium-brightness ones.  The moon will be a faint crescent and peak times will see 60-120 meteors per hour.

For more on the discussion of whether Phaethon is a burnt out comet or an asteroid, check out Astroprof’s page on the topic.  If you happened to download Celestia when I talked about it before, you can also download an add-on that includes a few thousand near-Earth objects.  Phaethon is included in that pack (it doesn’t come with Celestia by default, or at least I couldn’t find it).  That site (the Celestia Motherlode) has a number of very awesome additions to Celestia, so I recommend checking it out.

Why is the US focused on Iran so much right now? I say we focus on the real threat: Greenland. That’s right — Greenland.

Why nuke the poor peace-loving people of Greenland, you might ask. They are not doing anything per se. But they are sitting on a mountain of fresh water locked in their glaciers. And in those glaciers lies the key to temporarily halting global warming.

Nuke Greenland

So we drop a multi-megaton hydrogen bomb over Greenland and detonate it in the air. The heat wave will vaporize some of the ice, but the temperatures will be so hot for miles that much of it will melt (a hydrogen bomb reaches temperatures in excess of 10 million degrees Celsius at the burst point). The melting ice will flow into the ocean as runoff and then proceed to cool down the North Atlantic Deep Water current. The same current was cooled down about 8200 years ago when Lake Agassiz (a giant North American glacial lake 7 times bigger than all the Great Lakes combined) melted and drained into the North Atlantic. That melting event spurred a mini-ice age according to new research.

What about the people of Greenland? Do they deserve to die to cool down the northern hemisphere? Well, there are only 56,000 inhabitants, so we can easily relocate them to more sensible locations like the coastlines of the US. Once this plan is announced, miles of beach-front property will go on the market and will be easily purchased for next to nothing.

As an added benefit, monsoon seasons will be much lighter throughout Asia. All that pesky rain previously used for growing crops and the mosquitoes that pass on malaria will be significantly lessened. This will lead to thousands of lives saved who might have otherwise succumbed to malaria.

Look, Greenland is melting anyway. Do our children deserve to wait decades for it to slowly melt before they get relief from the awful effects of global warming? NO! Nuke Greenland now for our children. Those glaciers and the bears that live on them are the real terrorists.

I recently finished a literature review for my Language & Statistics 2 class. The topic was computational models of historical linguistics and my partner and I focused on cognate identification and phylogenetic inference. We split the work and my part was cognate identification. So I decided to blog about it for a bit and maybe someone out there will have something to offer. Granted, that won’t help my grade, but improving my understanding is more important. You can also check out our presentation.

First of all, to frame the problem, historical linguistics is a branch of linguistics that studies language change. Language can change in many ways, but the methods we looked at pretty much solely focused on phonological and semantic changes, with a few brief nods to syntactic change (on the phylogenetic inference side). The main tool used by historical linguists in reconstructing dead languages is the comparative method. This method looks at two languages suspected of being related and tries to infer the regular sound changes that led to the divergence. By examining lists of suspected cognates, they find sound correspondences — sounds that appear in similar contexts in both languages, but which aren’t necessarily the same phoneme. For example, the word for beaver in English and German derives from the Proto-Germanic word *bebru. In Old English, this became beofor (the f sounds like a /v/). In modern German, the word is Biber, with the /b/ phoneme preserved as it was in Proto-Germanic. So we could infer a sound correspondence between English /v/ and German /b/ in this context.

So what are cognates? If you have studied a second language, you no doubt have heard this term. I propose the following two classifications for cognates. A loose cognate will be a pair of words in two languages that is spelled or pronounced the same, with some minor variations. In this way, French resumé and English resumé would be considered cognates. Loose cognates have also been called orthographic cognates. A strict cognate is a pair of words in two related languages that descended from the same word in the ancestor language. Loan words are words that come into a language directly from another language, such as resumé. These words do not undergo the regular sound changes that are observed in strict cognates and so they are not considered cognates at all by historical linguists.

What is the effect the distinction between these two definitions would have on computational approaches to this task? I will look at this further in a future post, but feel free to post your thoughts in the comments.

The PISA (Program for International Student Assessment) test is administered to 15 year olds in industrialized countries every three years. The 2006 results were just released and show that US students are ranked 17th out of 30 in science and 24th in math. About 1.3% of students reached the highest level on the test overall with New Zealand and Finland having the most star pupils at 3.9%. [source (Note: may require free registration)]

Read the rest of this entry »

The snow continues here.  Today it was covering the road.  When I took Daedalus and Willow out in the morning, they both weren’t having any of it.  Daedal balked at the door and Willow was stepping gingerly and obvious wondering what had gone wrong with the world.  They finally got used to it, though the poor boy was shivering his butt off after a short while.  Willow was more in her element.  We’re gonna have to get him paw gloves.

Here is the scene out my office window.  They are currently building the new Computer Science Complex here.  One of the buildings is the Gates Center.  You can actually see shots from the live webcam 24/7, though the show is quite boring after about 4:30pm or so these days.  What’s amazing to me is that people are out there working right now.  In South Carolina, construction work ended as soon as the sky darkened and rain fell.  If snow fell, it would be like the end of the world had come.  I don’t think this makes them any faster, though.  Another bizarre difference between construction crews here and there is that there are no hispanic people here.  This is a very bad thing as it also means there’s crap for Mexican food.  You could find good Mexican every time you turned around in SC.

Computer Science Complex

I discovered the java.util.Properties class a couple weeks ago in the ginormous Java API docs. If you’ve ever created a software project where you have a lot of different settings that change frequently, this is the class for you. In my research, I implement all these different algorithms for various things, find out they don’t work, implement something else, rinse, repeat. Being able to look back at my results from two months ago and then loading the exact same configuration and running the experiment all over again is a must. Enter the Properties class.

Read the rest of this entry »

From the most excellent xkcd:

I felt the exact same way when I first picked up python.  It was like finding the holy grail of programming languages.  To be able to just throw things into a list and access them without having to worry about casting.  To throw around functions like they were variables.  To weave functions out of thin air and watch them vanish when their usefulness had expired.  It was magic.

Of course, the honeymoon faded.  I still use python as a first resort.  As a programming language for exploring new ideas, it can’t be beaten.  Development time is ridiculously fast.  There has been effort to get the runtime up to snuff as well, but with much reluctance I’m forced to admit it doesn’t compare to C or even Java, may God have mercy on my soul.  Granted, it all depends on the application, blah blah blah.

Despite all that, I still love it.  It’s definitely first in my heart as far as programming languages go.

Giuliani and Clinton are both faltering as the early primaries approach.  According to CNN, Giuliani is down 9 points while Clinton is down 11.  Huckleberry Hound is in second place for the Republicans, leading the has-been McCain, the ridiculous Thompson, and the irrelevant hypocrite Romney.  Meanwhile, Obomba is a distant second for the Sheepocrats with Edwards trailing at an even more distant third.  And of course, the principled candidate Kucinich is not even mentioned.  The poll in question has an error of +/- 5%, so it was pretty small.

The first (somewhat) serious snow came today. Well, not particularly serious, but compared to South Carolina where I’ve spent most of the last 22 or so winters, it would’ve shut down schools for two days at least. Daedalus acted a little prissy. He didn’t want to get his paws cold I guess. He got used to it quickly and went exploring. Willow had no problem with it. She loves the cold. The wind was gusting and that was more than the Bug could handle. I wish I could’ve taken a picture of his face. Normally getting him to come in is a hassle, but a quick “let’s go” and he was running for the door.

My lemon beagle Daedalus in the snow

My australian shepherd Willow in the snow

I mentioned the esoteric programming language brainfuck a little while back. It consists of 8 operations and was created in order to make the smallest compiler in the world (I think the current best is 174 bytes). I was reading a post over on Good Math, Bad Math that defines arithmetic in terms of sets. Pretty basic if you’ve done anything with set theory, but Mark has a clear way of explaining things so I usually try to read all of his posts. I’ve been playing catch-up today.  It struck me immediately how closely the set form that Mark describes matches the syntax/logical structure of brainfuck.  So I decided to play around a little.  Read on for more.

Read the rest of this entry »

Donna just had me dig up a pic of our Christmas tree 2 years ago, so I figured why not post it. I think it was beautiful. That was the Christmas we got Daedalus.  Be sure to scroll down to see the little boy.

Christmas tree from our house in Irmo, SC

My lemon beagle Daedalus as a puppy.  Still the cutest dog ever.

Donna is watching Good Times right now on TVLand and because I can’t help but devote a certain portion of my brain to the TV whenever it’s on, I heard this juicy morsel:

Michael: Dad, you sure look nice.
James: Yeah, I’m clean as a chitterling.

And unfortunately, it’s been a few minutes, so maybe I’m misquoting a bit. Anyhow, I’d never heard that particular saying before. Chitterlings aren’t something you normally consider clean, so I hit the net hoping to find some info about it. I found the following:

  • “Ludacris lookin clean as a chitlin as XM throws him a birtday party…” [source]
  • “thankx 4 add n me clean as a chitlin huh? nice suit” [source]
  • “These slick Jewtown thieves could pick it clean as a chitlin’ in half an hour.” [source]
  • “I’m cleaner than a chitlin washed in Clorox.” [source]
  • “Nelly is cleaner than a chitlin.” [source]

So precious few examples to work from. It appears to be African-American in origin, since all of the examples are from AA sources. Of course, chitterlings are pork intestines ingested as food and apparently stink like a sonuva bitch. When I worked in a grocery store (in Greenville, SC), we sold gallon buckets of the stuff. When preparing them, you have to clean them before cooking. To be honest, the very thought is making sick. So now we have origin and speaker group. Problem solved?

If anyone knows of any other occurrences or where the phrase came from, please let me know (in the comments).

When I was a