Sunday, December 31, 2006

How Is Sonnet Stacking Up?

The Sonnet stack is quickly taking shape. We are going to have some cool capabilities in KDE4 that make writing much easier. The stack, as it is planned, is shown bellow. The brackets show an estimate of the work that has been completed thus far.

EDIT: Based off a discussion on kde-core-devel, all Sonnet classes will be in the Sonnet namespace. So you can prepend Sonnet:: to the class names below.

Foundations

  • QString & other Qt classes - Provides 16 bit strings that store Unicode characters.
  • UnicodeData [90%] Provides means to query information from UCD files provided by the Unicode Consortium. A tool named parseucd is provided to convert ucd files into a data format optimized for fast lookups and low memory usage. This also allows users to regenerate any relevant data files in order to modify behavior in the rest of the stack.

Parsing (NLP)

  • GuessLanguage [50%] This class provides a statistical guess as to which language a given sample might be written in. It is based off a simple N-gram model and currently uses a trigram as well as other heuristics to determine a language. The class will be tuned to provide fast prediction for paragraph length text. [Currently based off Languid]
  • TextBreaks [90%] Provides a list of relevant breaks in a given string. The default implementation will use the suggestions provided by the Unicode Consortium. This should provide adequate partitioning for word and sentence boundaries in most the world's languages (where such concepts have meaning in orthography)
  • AbstractFilter/DefaultFilter [85%] Provides a customizable filter for determining words and sentences. This classes determines textbreak locations and then determines if each segmented part of speech is relevent to the target of the query. This can be customized in interpretable rules set by the user.

Correctness Testing (Spell/Grammar/Style Checking)

  • Currently KSpell2 uses a plugin framework for accessing spellchecker engines. AbiWord uses a framework they developed called Enchant which is preforms almost exactly the same task as KSpell and has a very similar interface for plugins. This is no coincidence since most spellcheckers implement an API designed for compatibly with ispell. In fact, KSpell has a Enchant plugin.
  • Sonnet will utilize Enchant as the interface to spellchecking and no longer support old plugins. This allows us to use the same spelling engines and rules along with the growing number of applications supporting Enchant. This also makes Sonnet more maintainable, bugfree and have more plugins available for more languages.
  • Grammar checking and style checking are highly requested features and will be available via Elixir. Rather than write a KDE specific framework for interfacing with grammar checkers, we are working with the developers of Enchant to provide a general library similar to Enchant but tailored to the needs of these types of tools.
  • Enchant [98%]
  • Elixir [5%]
  • Spell [99%] An interface to Enchant.
  • Grammar [50%] An interface to Elixir

Background Checking

  • The parsing and analysis of language is time intensive. Sonnet will replace the old KSpell2 background checking (based on QThread subclassing) with a ThreadWeaver based implementation that will support both KSpell and KGrammar. [10%]

GUI

No work has started on the GUI layer. Usability review requests have been made and I'm awaiting feedback. Until then, as can be seen, there is a bunch of lower level work to be keep busy with.
  • Configuration - Implement features to embed configuration of Enchant and Elixir in applications.
  • Standard Checking Dialogs & Widgets - This includes the dialog that appears when checking text and allows you to iterate through errors.
  • Highlightling - Automatic highlighting of misspelled words, etc...
Beyond the usability of the gui, some consideration is now being taken to determine proper behaviors for the actions associated with checking a document. For example, should "ignore" permantly ignore that word in the application? Systemwide for all applications? Or, just for the session in which the application is used?

Auxiliary Code

Sonnet have will a number of helpful classes and code snippets that can be incorporated into applications, including, but not limited to:
  • Automatic detection of language and using setting the spellchecker to use the correct dictionary.
  • Advanced statistics - word/sentence/other counts, readability scores(Kincaid, ARI, Fog, etc...)
  • Advanced layout hints - Example: should text containing 70% Hebrew be right aligned?
  • Tools to define and configure autocorrection.
Some of these classes might not be appropriate for inclusion in kdelibs and may be placed elsewhere.

Friday, December 29, 2006

New Conclusions

I've been collecting feedback on Phrasis over the past two weeks now. However, the holidays slowed my progress quite a bit. Today, the suggestions started to swirl around and then coalesce into something much more coherent.

My recent conclusions:

  • Writers don't want grammar checking. They want style checking and this may or may not include grammar checking.
  • Writers say they want work flow management. But they don't. When they do get it, most ignore or misuse it. I consider this analogous to a programmer and writing documentation. Yet, like code documentation, managing workflow is something good writers do. (But they call refer to it by different names and do it in different manners) So, how do you provide useful workflow to a writer?
  • There is a demand for limited dictionaries. Rather than having every valid spelling in the English language, some writers would like a subset suitable for a less literate or less technical mass audience. Undefined words would then be highlighted and the choice of their use would be deliberate.
    In the same token there are several well known algorithms available to analyze the 'readability' of text. They output scores that roughly correspond to grade level.
  • Some writers would like some kind of tagging to their text. This would be similar to a more general system of annotations. You could tag a paragraph "find sources" or "needs work" and then have some system to query the tag database in the document.
I'll more ideas posted later. I should have full requirements / feature plan document for version 1.0 up around 3 January.

Inkthinker on Phrasis

Kristen King, a top blogger in the writing blogsphere made a post just before Christmas soliciting suggestions for Phrasis.

Read: Open-source text editor for writers

Thursday, December 28, 2006

Phasis is gaining momentum

I'm now in communication with a writer from Linux.com who may be writing a story on Phrasis. This is great news since I need *much* more feeback on how Phrasis should end up. I'll post an update when I learn more.

Phrasis has two more ways to connect to users. On IRC you can now go to #phrasis, a new channel on freenode. There is a new public wiki as well.

Plus, all the work I'm doing in Sonnet for KDE will be used in Phrasis once I get the chance. This is great news for internationalization support and for platform integration.

Cheers!

The latest SVN commit feed

Several people have asked me about the svn commit log that is shown at the bottom of my blog. The code to generate it is very simple.

I use two services feed2js and CIA. CIA is an interesting service that tracks repositories for open source projects. I then take the feed they provide and use feed2js to generate a javascript which will convert the latest rss entry to html.

<script language="JavaScript" src="http://feed2js.org//feed2js.php?src=http%3A%2F%2Fcia.navi.cx%2Fstats%2Fauthor%2Fjrideout%2F.rss&amp;num=1&tz=-7&amp;html=a" type="text/javascript"></script>

Wednesday, December 27, 2006

KAutoSaveFile

Thiago Macieira made a post last month, detailing the opportunity for a new KDE developer to hack on kdelibs. At that time I wanted to jump on it, but was bogged down with schoolwork. Then, this past week, after I received my subversion access to KDE, I looked to see if anyone had taken Thiago up on the offer. No one had, so spent a few hours and wrote the implementation to KAutoSaveFile.

With KAutoSaveFile you can easily create a temporary file to write unsaved data in. If the application fails, you can recover any lost documents. KOffice will be changing its own implementation of this feature to KAutoSaveFile shortly.

On a more general note. Aaron Seigo and I have been conversing on the overlapping file classes and methods in KDE. He has just outlined what should now be used in his blog. All this simplifying is great.

Sunday, December 24, 2006

Zach, on Sonnet

Here is Zach introducing Sonnet. It is a bit old (May 2006) yet still relevant. I plan on slowly taking over some of the maintainer responsibilities from Zach.

Talking about spell checking I played a bit with Sonnet over the weekend. I've been handling KSpell, then KSpell2 for a while and then I just grew tired of it last summer (for various reasons not really related to the code itself). I've been toying with the idea of full linguistic framework for a while. Besides spell checking we're talking about grammar checker, dictionary, thesaurus and translator. Sonnet is just that - full linguistic framework. I'd like to have all those functions available to all KDE applications. Being able to take a step back, look at all the problems I've seen and complains I got over the years from both users and developers and just sit down and rework the whole framework to fix them is great. Linguistics is fascinating and for some reasons there's not a whole lot of people who'd want to deal with it, at least not as far as its desktop usage goes.

KDE Digest

I'm in this weeks KDE Digest! I've been working on Sonnet. Sonnet (also known as KSpell2) will be the spelling & grammar checker for KDE4.

I'm working on a Unicode compliant parser for word and sentence boundaries rather than the regex hack we have now. While the regex worked fine for English and most European languages, it didn't work at all for other scripts, like Hebrew or Devanagari. It now does (well mostly, we have a few bugs).

Sonnet works by having plugins for various spelling and grammar engines. Once, Zach (the official maintainer of Sonnet) commits the grammar interface he promised, (It's somewhere on his computer he tells me) I can convert my link-grammar engine interface to a Sonnet plugin.

Sonnet also has a few UI elements that need some usability love. This should follow in a few weeks.

KDE4 is going to rock!

New Blog

Hey, I'm going to try this blogging thing, for the second time, let's see if I can keep it up. After Christmas I'll post to introduce myself.