Article: Part of Speech Tagging and the Ego Cell
By Michael Rice, Aikernel Project Administrator
Part of speech taggers have been with us for a long time. A tagger is a piece of software that uses a string of natural language (like, "my car is red") and tags each word, or maybe a small grouping of words, with its associated part of speech.
You remember your parts of speech, don't you? In America, we were mostly taught eight parts of speech based on the work of a pioneer named Thrax. Thrax lived a long time ago, but his work lives on and is well known to most school kids around here. Perhaps in most other English speaking countries as well. He basically divided the words of a sentence into the roles they play in that sentence. Consider the following sentence:
John runs quickly.
Each of the three words in this example has a different part of speech:
So we could rewrite this sentence with its parts of speech, as taggers often to, as:
John/NOUN runs/VERB quickly/ADVERB
There are probably two things you are asking yourself right now:
(By the way, I'm just guessing what your questions are! Send me an e-mail to ask me a specific question and we'll update this article.)
Let's consider your first question: why is this a whole subject in an of itself? Well, because human speech is, and probably always will be, highly ambiguous. Our goal is to create a piece of software that makes a dis-ambiguous statement out of it. One important part of accomplishing this goal is to understand how different words are being used in different statements. This gets us one step closer our goal and it gets us one step closer to acting more like human intelligence by giving us important clues about what the user is trying to accomplish.
In the past, there were really two different approaches taken to this accomplish this goal. The first is a statistical (also known as a stochastic) method. The second is using a rule-based approach (probably more like you learned in school). Both methods have been around almost as long as people have tried to make machines intelligent. Generally, the stochastic approach has been more successful than the rule based approach -- in fact, some taggers have accomplished over 96% accuracy. This is probably because the rules that humans use can be as ambiguous to a machine as the language we are trying to parse! However, in recent years the two different schools of thought have started to come together, yielding some highly successful taggers. Brill's (who now works for Microsoft Research) technique is a good example of this.
Neither approach is prohibitively difficult to implement. In addition to their accuracy, the statistical approach has the advantage of preventing the developers from having to develop too much content (in the form of rules and language sets, also known as a corpus) and can also be applied to other languages with relatively little pain.
However, the downside is that the stochastic taggers do require a large amount of opaque statistical code and can take a lot of effort to refine. If you'll follow me for just a moment as I divert from our main topic, you'll notice that operating systems are much the same. They are relatively easy to get off the ground, but require an immense amount of usage in diverse configurations to perfect and stabilize. I believe this is a perfect task for a bunch of open source enthusiasts... and I this it is also why the linux teams have created been able to create such a robust operating system.
Because of the initial up front complexity, I've implemented a very simple rule based approach to tagging. By doing this (as opposed to the stochastic approach), I can accomplish the following:
The tagger I have implemented very quickly is not even as good as those designed in the 1960's. One of the first successful rule based tagger was a dual stage tagger. This is how it worked:
In this first version, the system simply finds the part of speech that this word belongs to and assigns that as the part of speech. In this way, our first rough cut at an open source tagger is a single stage implementation.
So we've got our work cut out for us, what good does it do us?
If recall, the Activator / Context model you can assign a certain word, such as "car" as a FACT to be consumed by your cell when it gets an ActivationEvent. With the part of speech API introduced as a member of the aik.logic.nlp.tagger package, you can also receive the same Activator as a NOUN (or, if you specify a different tagset, such as the Penn Treebank you might get a NN -- more on this later). Why the dual effort?
Well, because if you want to receive the Activator "car" as part of the ActivationEvent, you would have to define "car" or a thesaurus lookup, or a morphologically similar word (such as the plural cars). So, let's say you wanted to create a cell that lets people buy stuff for their car. You could define the word "buy" as an ACTION activator and they define all the different types of things you can buy, such as cars, radiators, mufflers, and the like. If the user says "I want to buy a muffler", you will receive "buy" as an ACTION activator and "muffler" as a FACT. This is powerful information because you can figure out what the user wanted to do. But it also assumes that you have a good idea of the whole set of things that a user would want to do.
Let's consider a similar, but different example. Let's say you want to create a cell that lets users buy anything. This would be tough because you'd have to define all FACT activators from armadillos to the U2 Zooropa CD (OK, that was my lame attempt at coming up with an A to Z list without picking the obvious Zebra as a "Z"!). This is where the part of speech tagging system becomes so powerful. Now, let's say you defined "buy" and its thesaurus alternatives as you ACTION and you want to receive whatever noun there is as a FACT. So if someone says "I want to buy that armadillo" you will get two activators as part of your ActivationEvent, one for the ACTION buy and one for the NOUN "armadillo". You can also take this example further. Perhaps you are interested in the definite article "that" in "I want to buy that armadillo". This might differentiate "that" from "the other" armadillo. You can ask for this too. You see, to computer linguists and taggers, there are more parts of speech than just Thrax's original 8 parts. In the aik.logic.nlp.tagger package you'll find some classes that have static members of them. These correlate to the different parts of speech defined throughout the years. The simplest is the Penn Treebank, which defines 46 parts of speech. Support for the C5 and C7 tagsets are coming. The C7 tagset has 146 different parts of speech defined. More or less nuance may be more or less important to you depending on how you define your cell.
Moving right along, this is a good time to talk about the Ego Cell. While still in its infancy, I've developed this cell mostly to showcase the capabilities of the part of speech tagging, but also to the Aikernel to be able to chat about the user about the different features and functions of the Aikernel. For example, "Do you support multitasking?". So perhaps we will get the ActivationEvent with 'support' as our ACTION and 'multitasking' as a NOUN.
Ah ha! So, you say to yourself, "but I still need to teach the system what the word 'multitasking' means!" An you are, of course, absolutely right. This is why I delayed implementing part of speech tagging in the first place, because it only gets you part of the way to where you need to be. The most important thing your cell can do is understand the word "multitasking", no matter what part of speech it is in. Part of speech tagging really only minimizes the amount activator declarations you need to make.
Perhaps in the future, we can think of a way to categorize or define knowledge so that even the cell doesn't need to know in advance what a word like "multitasking" means, but for now, this is the state of affairs.
We've still got a lot of challenges ahead with this technology, but at the end of the day, I hope that you'll find this to be a valuable addition to the Aikernel. I have to confess that I didn't intend to include this feature in this release (v1.3.3) without the full implementation of a statistical/stochastic tagger, but then it occurred to me that we could do a partial implementation and we would benefit from taking the API "out for a walk" before we burn too many programmer cycles on the actual tagger implementation.
|contact project admin|