Tuesday, January 30, 2007

Instant Messaging

On the kde-devel mailing list regarding the term "instant messaging." I am no expert at English grammar, but do have some knowledge on what might be going on. The term has taken hold in the popular lexicon. American Heritage Dictionary defines instant messaging as:

n. The transmission of an electronic message over a computer network using software that immediately displays the message in a window on the screen of the recipient.
In English, if a given lexeme has verb form, it usually also has a gerund, which is often equivalent to the progressive (often called participle) form of the verb. There are many cases where the infinitive also has noun that describes the action of preforming the verb, unlike the the gerund, which describes the act itself. Gerunds have special rules in English and can be clefted unlike the verbal forms they often act as substitutes for.

For example:
verb: to run (infinitive)
noun: run = the act of running
verb: running (past progressive) (i.e. she was running) = an inflected form of the verb
gerund: running (noun) = the action of the verb to run

Of course, there the are many lexemes in English where the past progressive can also be used as an adjective further confusing the matter.

So, "instant messaging" can be parsed several ways. For the several possible forms messaging could refer to there also exists a complementary form for instant, but I can take a stab at the intended meaning via the following examples:
  • Ian enjoys instant messaging his friends. ("messaging" is a gerund)
  • Ian is instant messaging his friends. ("messaging" is a participle verb)
Note that in the first sentence, the clause, "instant messaging his friends" is actually acting as a single noun.

There seems to be no problem using "instant messaging" as a noun in English.

Thursday, January 25, 2007

Language Detection Works!

I've finally been able to put some of Sonnet's many pieces together. Initial integration of language detection into the spell-check highlighter class has just been committed.



The screen shot shows text that I copied from the Hebrew, German and English homepages of Wikipedia in konqueror. Upon pasting them in the simple test application Sonnet detected the languages in a background thread and then proceeded to spell-check the paragraphs, also in a background thread.

Sunday, January 21, 2007

to boldly justify my perscriptivism

The following is a retort to an email attacking Sonnet and spell checking in general. I had initially written a vindictive reply, which I decided not to send and instead rewrote for a more general audience without, I hope, the angry overtones.

Languages are living, changing, amorphous things. Whilst it is possible to categorize them for certain uses, such classification seems to fail in other contexts. If we were to honestly demarcate a category for every language on Earth, there would be one for every human being on the planet. We all have our lexicons, grammars and even orthography. Of course, a personal language would be near useless if there was not a group to comprehend it. Therefore, we often classify languages into groupings of mutual intelligibility and hierarchies by degree thereof.

Spelling and grammar rules exist to convey our information in a manner that can be understood by others who also understand these rules. Style guides (should) provide hints on presenting information in a way less ambiguous. There is great tension among scholarly and armchair linguistics alike in characterizing certain aspects of language as correct. I've certainly felt this tension internally, while working on Sonnet. Yet, the need to communicate dictates a priori the necessity of common protocols and widely used convention.

So, to the angry woman (who thankfully was not a contributor to KDE, although was bitter user) who deemed it worth her time to write and send a tirade outline the alleged hypocrisy of descriptive linguists creating prescriptive software I suggest that she consider my true purpose.

The need to communicate clearly outweighs the disadvantage of using a language in a “non-standard” manner. But how are “non-standard” uses defined? Should we follow decree, common convention or the prescription of articulate, but deadline driven and uniformed writers? I certainly won't (and don't) posit myself as the arbiter of linguistic rule; although, I have my opinions of English usage. I simply wish to improve upon the existing technology that enables better communication. Providing optional languages aids is but one part of this desire. Furthermore, seeking to empower users of minority tongues does not force them to abandon their unique and valuable linguistic traditions. Merely providing users with tools, allows those that wish to ensure their writing is understood by others to verify that it is.

Another Reason ODF Rocks

Bill Poser, writes at the Language Log:

Now, you might be wondering what this all has to do with linguistics. Well, one of the things that document metadata specify is the language of the document. The Open Document standard does this correctly. It uses (p. 61) the three-letter language codes of ISO-639, followed by a two-letter country code following ISO 3166. This allows for the specification of any of the world's languages. A three letter code allows for as many as 17,576 languages. ISO-639-3 in fact already encodes most of the world's approximately 6,700 languages. Open XML, on the other hand, does not follow ISO-639-3. Instead (section 2.18.52), it requires that languages be specified by means of two hexadecimal digits, e.g. 0x09 for English. That means that no more than 256 languages can be accomodated. The list of languages available is in the document referenced above on pp. 2531-2537 but for the two-letter hex codes you'll have to look elsewhere because Microsoft doesn't list them together with the languages. For some reason it gives a completely different set of non-hexadecimal codes ranging from 1025 to 58,380. The hex codes can be found in the fourth column of this table, the one labelled "Win Code".
Three cheers for standards and simplicity. I'd love to use the same language tagging standard as ODF in KDE, but there are few current limitations, depending on context. Most spell checkers require 639-1 codes for the language part. Sonnet uses these as well, for languages that have them and 639-3 codes for those that don't.

Treating macrolanguages and separate dialects with distinction (when relevant) has caused end users quite some concern. Most the problems I've been alerted to are being addressed. I've found the world wide community of KDE to be quite helpful and informative. Time that could have been spent programming has been instead consumed researching the differences between the Norwegian [nor] languages, Bokmål [nob] and Nynorsk [nno]. Or tracking done some obscure variation of Cornish, only existing in one town, which I won't yet mention since I might publish something about it in the upcoming year.

Wednesday, January 17, 2007

Queen and Country

Can Sonnet detect the difference between en_US and en_GB? No, and yes. I've been asked that quite a bit, so I'll clarify. There are several key requirements, as I see them, for adding language detection to KDE.

  1. It must distinguish between different languages. While all supported languages could be detected, it is more likely that a user will only use a few languages in most of his sessions or that the program will be used by in setting by people speaking a specific range of languages. For example, a speaker of French and German, or school computer lab with speakers of Xhosa, Zulu and Afrikaans.
  2. It must be Fast. The detection must occur in real-time; otherwise, the user might as well select their language manually.
I've optimized the language detection for the above criteria. The models used are limited so that they are fast. This makes detection less accurate, but this is overcome, by taking into account other factors such default settings or the language detected for surrounding text segments.

Are users of multiple dialects less important? No, but if I value priorities (1) and (2) above, then the detection of the dialect will be highly inaccurate due to the nature of the algorithms used. However, detecting a sample of text written in a dialect as the common language currently is robust. It must be, the detection would be useless for our purposes if it were so brittle as to fail on minor spelling differences, which are likely candidates for errors. Additionally the statistical differences between dialects are much less then between different languages. In most cases, text in a different dialect just isn't contrastive enough in the greater scheme of things.

Now, there are ways to distinguish between American and British spellings, but doing it at the statistical comparison stage isn't reliable with the current heuristics used, but could be done if there was enough user demand. However, any application that utilizes the languages guessing class should be a bit smarter. The convenience classes Sonnet will provide will distinguish between countries by checking user settings. If the users locale is de_BE and French is detected then the spell checker would default to checking with fr_BE; it would fallback to fr or fr_FR if a dictionary for fr_BE was not found.

In summary, the language detection class will not distinguish between cases where the same language has a different orthography in a different country. Yet, tools built using this class will distinguish between them using alternative means. But, I could be wrong in my assumptions. If enough complaints surface, it's early enough to change the behavior.

For all those who emailed me questions similar to ris:
[08:28] <ris> rideout: re: sonnet test sentences - would it be useful to put a british english test sentence in? would sonnet be able to distinguish between the two if you stuck a few 'realise', 'centre' and 'programme's in?
It is very reasonable to test that British English is detected as English. Empirically, Sonnet is not encumbered by the mild differences that exist between British and American orthography.

EDIT: I meant to ask this originally, but became carried away and forgot. Are there languages for which I must distinguish between dialects? I can think of examples like Chinese where the dialects are essentially different languages. But, in this case they all share a common orthography. Are there those with different orthographies?

NOTE: In this post, I did conflate the notion of dialects and countries with different standards of orthography. To the pedantic, please don't shoot. To the curious, I try to limit the use of technical linguistic terms in this blog and to use the imprecise and ofter confused vernacular, this has led to some confusion in the past.

New Sonnet Mailing List

Sonnet now has a mailing list. When possible, please post both questions and suggestions for Sonnet to this list rather than on this blog or in a personal email. Everyone interested in how Sonnet will end up should subscribe.

Thanks for all the great help so far.

Tuesday, January 16, 2007

Can your language be detected?

Today, I've added a few more languages to Sonnet::GuessLanguage. I've also improved the speed considerably, removing a linear search. Below are the ISO 639 codes that currently should be detected.

af, ar, az, bg, bn, bo, ca, ceb, cs, cy, da, de, el, en, es, et, eu, fa, fi, fr, gu, ha, haw, he, hi, hr, hu, hy, id, is, it, ja, ka, kk, km, kn, ko, ky, la, lo, lt, lv, mk, ml, mn, my, nb, ne, nl, nr, nso, or, pa, pl, ps, pt, ro, ru, si, sk, sl, so, sq, sr, ss, st, sv, sw, ta, te, th, tl, tlh, tn, tr, ts, vi, uk, ur, uz, ve, xh, zh, zu

Please let me know if your language isn't supported and you would like to help.

Each these languages should have a unit test to ensure they are actually detected. Take a look at the list of tests and see if your language is listed. If not, please send me an example sentence.

EDIT: Thanks to everyone who has helped to make corrections so far. If the test is bad (i.e. contains more proper names than suitable for a test) then please post or send a sentence that is more representative of your language.

Thursday, January 11, 2007

Wikipedia, A Source Of Error Corpora

They guys from Morfologik have created a neat method for gathering error corpora.

They've succinctly described why such data sources are needed:

Background. The developers of grammar checkers, and autocorrect lists, have hard times with finding relevant corpora. Revision history is an excellent source about native speakers perception of linguistic norms. Frequently revised typos are perceived as errors that need to be corrected, so using these typos on autocorrect lists is justified. The same goes for style, grammar and usage errors.
Well, where is the biggest source of revision history on the planet? Wikipedia. You can read the whole post on the Morfologik blog.

Open Source Development Metrics

I find statistics and charts on the development of open source quite fascinating. In the past I've used tools such as CIA, which tracks commits to project like KDE. It is then possible to track all manner of details.

Today I came across Ohloh. It describes itself as, "Mapping the open source world by collecting objective information on open source projects." Check out my statistics. Ohloh also tracks statistics for KDE overall.

KDE Statistics
This data is collected solely from kdelibs and kdebase in trunk. It makes an assumption of the number of "person years" it would take to write the the lines of code we have, and then multiplies that by an estimated $55,000 (USD) per year, per developer to produce some rough estimate what it would cost to develop KDE in the absence of volunteer work. Of course this neglects all the other non-programming work, as well as the work on KDE 1, 2 & 3.

Naturally the English Breakfast Network (EBN) provides our own statistics on both project activity and potential errors in code or documentation.

Danny Allen's Commit Digest has been getting better as well. Each week we are treated to some beautiful charts dissecting all the (programming) work that has gone into KDE.

Wednesday, January 10, 2007

Spellcheck and Usability

Lately, I've been thinking about the user interface KDE uses for spellcheck. There are two primary modes used:

  1. Dialog checking: A dialog appears, and iterates over found misspellings. Suggestion and ignore features are available in the dialog.
  2. Inline checking: Misspellings are highlighted or underlined red. Context menus provide correction options.
So I'll open the floodgates for suggestions in KDE4. What could be done to improve these interfaces in general? Are there alternate interfaces? What could be improved in KDE's GUI as it exists.

A Web Service For PWLs?

Dwayne wrote the following in the comments of a previous post:

Glad to see your rolling this into a spec. I'm so tired of OOo and Mozilla and everything else using different spell checkers and differrent PWLs. I am also very interested in spell checking and related themes. At Translate.org.za we've developed spell checkers of varying quality :) for the 11 official languages of South Africa. I'd like to know what we need to supply to get languages guessing working in Sonnet for those languages. Can you write a blog entry on what you need? One thing I would love is that users can submit their personaly dictionaries for possible inclusion into the formal dictionary. Every time someone adds a word to a personal dictionary there is a chance that it should go into the official one. We could create a poweful network of dictionary improvers.
Let me explain exactly what is happening. The spec on freedesktop.org (which has yet to be written) defines common interfaces for spell checking engines. The various spelling engines need to provide an interface conforming to the spec, or since they don't, we provide a wrapper.

The spec doesn't create a new spelling engine. However, it seems the dictionaries Translate.org.za provides are in myspell format. If installed, the Myspell plugin for Enchant would use them. It does this transparently, calling Enchant (or some spell checker that uses it) with "af_ZA" just works. Of course, Sonnet can detect Afrikaans. So in KDE4 you won't even need to set the language; the document will just start using the relevant dictionary.

Now Dwayne's second question is much more interesting. There should be someway to harness the collective power of all KDE (or even a more broad category of) users personal word lists and aggregate them. This is a great project for someone to pick up. I might even get the itch if I could finish all the projects I've started first.

I envision a website with some standard interface that would allow you upload wordlists for a particular language. A client for KDE could be made that queries Sonnet and retrieves your pwls. You could then download aggregations of words based of some criteria. For example, "Get every word with more than 5 instances in fr_FR." This could then be merged with your own list or be kept separate. The client could do this all seamlessly.

This would provide valuable data to those who study the addition of new words into a language. This could also be used to create dictionaries for languages which currently don't have them. Provided of course, that the language is sufficiently analytic.

EDIT: I just came across Joukahainen. A Finnish web application that is quite similar to what I describe. It's GPL so someone might be able to use it to create my vision.

Tuesday, January 9, 2007

Freedesktop.org Spec for Language Checking

I've added a new spec to the Freedesktop.org wiki for Desktop Language Checking. It will be used to coordinate efforts between Gnome, KDE and others on spelling/grammar/style/diction checking.

Also check out this Gnome bug report: Bug 383706 – Adding support for spellcheckers into the Gtk+ stack

New Domain Name

My blog can now be read at blog.jacobrideout.net. The old site on blogger can still be used. In fact, it is the same site. I've just added a CNAME DNS entry via Blogger's new custom domain feature.

Wednesday, January 3, 2007

Pourquoi enchanter le dragon ?

I keep getting asked why we Enchant will be used in Sonnet. There seems to be a fear that the language of the questioner will no longer be supported. This is not the case. Enchant supports all the current spell checking engines in KDE - without extra layers of indirection. Enchant does the exact same thing as the old KSpell plugins. But there are some the additional features that Enhant supports, but KSpell does not:

  • More languages are supported. Enchant has plugins for many more languages than KDE currently.
  • Per language, engine preferences. If you write documents in multiple languages, you can choose the best checker for each language.
  • Enchant supports emulating session and personal dictionaries for a checker that don't support them.
  • Persistent settings across both Gnome and KDE
We could add these features to what is currently available in KSpell2, but why? With Enchant we share the burden of maintenance and support.

Take a look at Enchant's website for more information