Saturday, February 10, 2007

Graph Library Suggestions

I have an idea for an algorithm that requires a graph. Any suggestions on good library in c++ for the data structure would be nice, then I wouldn't have to reinvent the wheel. There is one special requirement; there can be (but needn't be) two edges per connected nodes. Each edge has a direction and weight. If the edge is implemented as a template class, then it can store a tuple or pair with the differing weights for each direction. I've taken a look and boost, but it seems like overkill, Below is an example of what I need.



I created the image above in Kivio. This is the first time I've had an opportunity to use it. Great work KOffice team!

EDIT: I'm going to go with boost for now. Additional suggestions are still welcome.

Friday, February 9, 2007

Sonnet Updates

I've been fairly busy the past two weeks and haven't put as much work into Sonnet as I would like. However there are several recent developments to mention.

Language Detection
We now have preliminary support for distinguishing between pt_PT and pt_BR as well as en_US and en_GB. Portuguese seems to be a special case that most NLP programs explicitly acknowledge, which I now understand. I'm not sure what should be done to additionally distinguish between en_ZA or en_AU. I've a few ideas and will let everyone know my thoughts after more testing is done. I really didn't want to start messing around with dialects, but the response to that position has been massively against me; so into the fray I go.

Elixir
The engine and documentation for Elixir is now ready for public scrutiny and comment. It has been interesting for me to write, since I've decided to only use C++ and the standard libraries so that there wouldn't be dependencies. Qt has really spoiled me :) The work has so far been done in my personal subversion repository. It will be made public as soon as Bug #9775 has been fixed and I have a working freedesktop.org CVS account.

Documentation
Aseigo mentioned the need for documentation today,

"sonnet might be cool, for instance, but unless there's a tutorial that lets people start using it in their application code quickly it'll almost certainly end up under-utilized and/or take many more revision releases of kde4 to find its potential realized."
So true. The public interfaces for Sonnet have only just settled down, and some are still on my computer and yet to be committed. So, this weekend I'll update most the changes littering my working directory and start outlining some tutorials to be put in the wiki. I've been fairly good in providing apidox so far, but those need some improvements as well. Of course, KDE4PORTING.html needs to be updated as well.

Merging
I've been hesitant to merge into trunk while the interfaces rapidly changing, but now that isn't much of a concern. A list of programs and libraries in kdepimlibs, kdebase and several other specific cases has been compiled and I'll be able to modify them when merging to ensure they build. For those projects that I personally won't migrate the tutorial should enable their developers to migrate seamlessly.

Wednesday, February 7, 2007

Sonnet In The Press

Nathan Sanders of OSTG interviewed me for an article last week. The result has just been published on Linux.com in an article entitled, "KDE 4's Sonnet will turbocharge language processing."

Overall, I'm pleased with the coverage, but I do have a few misgivings; although, any minor errors likely are my fault for providing limited explanations in the interview. The scope my concern of is largely limited to grandiose statements I did not intend. For example, "[I]mproved multilingual support is the "most requested change" from KDE 3..." I didn't really mean this; it was a context sensitive statement. I meant something such as, "Excluding technical issues like, 'KSpell doesn't work for me' The most requested features for KSpell that I know of (from end users) involve improving its multilingual support." But, requesting extra qualifiers for my statements is more likely an exercise in vanity than in promoting greater truth.

I will mention, the article doesn't note the work of David Sweet. I'm not familiar with what exactly he did, but my understanding is that he wrote much of the original KSpell and thus much deserves some credit.

Tuesday, January 30, 2007

Instant Messaging

On the kde-devel mailing list regarding the term "instant messaging." I am no expert at English grammar, but do have some knowledge on what might be going on. The term has taken hold in the popular lexicon. American Heritage Dictionary defines instant messaging as:

n. The transmission of an electronic message over a computer network using software that immediately displays the message in a window on the screen of the recipient.
In English, if a given lexeme has verb form, it usually also has a gerund, which is often equivalent to the progressive (often called participle) form of the verb. There are many cases where the infinitive also has noun that describes the action of preforming the verb, unlike the the gerund, which describes the act itself. Gerunds have special rules in English and can be clefted unlike the verbal forms they often act as substitutes for.

For example:
verb: to run (infinitive)
noun: run = the act of running
verb: running (past progressive) (i.e. she was running) = an inflected form of the verb
gerund: running (noun) = the action of the verb to run

Of course, there the are many lexemes in English where the past progressive can also be used as an adjective further confusing the matter.

So, "instant messaging" can be parsed several ways. For the several possible forms messaging could refer to there also exists a complementary form for instant, but I can take a stab at the intended meaning via the following examples:
  • Ian enjoys instant messaging his friends. ("messaging" is a gerund)
  • Ian is instant messaging his friends. ("messaging" is a participle verb)
Note that in the first sentence, the clause, "instant messaging his friends" is actually acting as a single noun.

There seems to be no problem using "instant messaging" as a noun in English.

Thursday, January 25, 2007

Language Detection Works!

I've finally been able to put some of Sonnet's many pieces together. Initial integration of language detection into the spell-check highlighter class has just been committed.



The screen shot shows text that I copied from the Hebrew, German and English homepages of Wikipedia in konqueror. Upon pasting them in the simple test application Sonnet detected the languages in a background thread and then proceeded to spell-check the paragraphs, also in a background thread.

Sunday, January 21, 2007

to boldly justify my perscriptivism

The following is a retort to an email attacking Sonnet and spell checking in general. I had initially written a vindictive reply, which I decided not to send and instead rewrote for a more general audience without, I hope, the angry overtones.

Languages are living, changing, amorphous things. Whilst it is possible to categorize them for certain uses, such classification seems to fail in other contexts. If we were to honestly demarcate a category for every language on Earth, there would be one for every human being on the planet. We all have our lexicons, grammars and even orthography. Of course, a personal language would be near useless if there was not a group to comprehend it. Therefore, we often classify languages into groupings of mutual intelligibility and hierarchies by degree thereof.

Spelling and grammar rules exist to convey our information in a manner that can be understood by others who also understand these rules. Style guides (should) provide hints on presenting information in a way less ambiguous. There is great tension among scholarly and armchair linguistics alike in characterizing certain aspects of language as correct. I've certainly felt this tension internally, while working on Sonnet. Yet, the need to communicate dictates a priori the necessity of common protocols and widely used convention.

So, to the angry woman (who thankfully was not a contributor to KDE, although was bitter user) who deemed it worth her time to write and send a tirade outline the alleged hypocrisy of descriptive linguists creating prescriptive software I suggest that she consider my true purpose.

The need to communicate clearly outweighs the disadvantage of using a language in a “non-standard” manner. But how are “non-standard” uses defined? Should we follow decree, common convention or the prescription of articulate, but deadline driven and uniformed writers? I certainly won't (and don't) posit myself as the arbiter of linguistic rule; although, I have my opinions of English usage. I simply wish to improve upon the existing technology that enables better communication. Providing optional languages aids is but one part of this desire. Furthermore, seeking to empower users of minority tongues does not force them to abandon their unique and valuable linguistic traditions. Merely providing users with tools, allows those that wish to ensure their writing is understood by others to verify that it is.

Another Reason ODF Rocks

Bill Poser, writes at the Language Log:

Now, you might be wondering what this all has to do with linguistics. Well, one of the things that document metadata specify is the language of the document. The Open Document standard does this correctly. It uses (p. 61) the three-letter language codes of ISO-639, followed by a two-letter country code following ISO 3166. This allows for the specification of any of the world's languages. A three letter code allows for as many as 17,576 languages. ISO-639-3 in fact already encodes most of the world's approximately 6,700 languages. Open XML, on the other hand, does not follow ISO-639-3. Instead (section 2.18.52), it requires that languages be specified by means of two hexadecimal digits, e.g. 0x09 for English. That means that no more than 256 languages can be accomodated. The list of languages available is in the document referenced above on pp. 2531-2537 but for the two-letter hex codes you'll have to look elsewhere because Microsoft doesn't list them together with the languages. For some reason it gives a completely different set of non-hexadecimal codes ranging from 1025 to 58,380. The hex codes can be found in the fourth column of this table, the one labelled "Win Code".
Three cheers for standards and simplicity. I'd love to use the same language tagging standard as ODF in KDE, but there are few current limitations, depending on context. Most spell checkers require 639-1 codes for the language part. Sonnet uses these as well, for languages that have them and 639-3 codes for those that don't.

Treating macrolanguages and separate dialects with distinction (when relevant) has caused end users quite some concern. Most the problems I've been alerted to are being addressed. I've found the world wide community of KDE to be quite helpful and informative. Time that could have been spent programming has been instead consumed researching the differences between the Norwegian [nor] languages, Bokmål [nob] and Nynorsk [nno]. Or tracking done some obscure variation of Cornish, only existing in one town, which I won't yet mention since I might publish something about it in the upcoming year.