WordResolutionAlgorithm

See the BPFK's Morphology Forum

I think I could not spent one day more without creating a wiki page for this issue.

While Lojban has a very precise, strict, easy-to-parse, thus unambiguous grammar, it has not (emphasized) been proved syntactically unambiguous.

Let me rephrase this statement.

If we consider the Lojban language in a purely formal sense, involving an alphabet composed of its words, supposedly classified into lexemes, then any finite string of lojban (stream of words) can be proven either grammatically correct or not in a finite amount of time using a turing machine.

This is what the Lojban grammar is for : it defines axioms and inference rules between its words (more precisely its lexemes) in a form that can be fed into a LALR(1) parser. And because the LALR(1) parsing algorithm can be implemented using a Turing Machine, and that it can tell, within a finite amount of time, whether a (finite) string of lexemes follow its grammar or not, we have the result stated above.

Now, what does that mean ?

It means that given any stream of Lojban text, if we are able to distinguish its words and categorize them into lexemes, then (roughly) we can parse the text using a computer; however it actually presupposes that we know a way to distinguish the words and categorize them.

And so ?

Two facts :

  • To this day, the way to distinguish the words in a lojban stream of text is very, very poorly documented (from a computer programmer's point of view). The only rules are given in the refgram, are vague, and were never proven unambiguous.

  • The most precise lojban lexer to this day, the morph algorithm of jbofi'e, has not been proved complete nor unambiguous, while already beginning to be hugely complex. IIRC, the more a system is complex, the less it is credible to the human mind.

Epilog


As far as I can understand in the morphology of lojban, a complete lexer simply cannot be written using a LR(k) nor LL(k) grammar, for various reasons related to backtracking and the relation between semantics and morphology.

What does anything have to do with semantics? brkwords.txt will correctly break up djan.polake'ebrIvlaboinoigIsmumu just as well as something grammatical. — However, we still have a hard time with "su" and "sa", even "si" when it is used to erase a morphologically wrong piece of text. Not to mention that it is possible that the morphology in itself is possible, while altogether being unambiguous when considered together with the grammar (like an ambiguity that cannot happen because the ambiguous lexemes are not allowed in such or such place by the grammar).

The attempt in jbofi'e to solve the problem using NFA's is interesting, but leads to a huge graph of automata, which is very difficult to prove right.

While continuing to search for an elegant algorithmic solution to this issue, I nonetheless believe that real parsing of Lojban using computers (one which involves a real lexer) is a very difficult task unlikely to be done right soon.

--Elrond



One should see the file brkwords.txt for the official LLG morphology algorithm.

  • Definitely not official!
    • It is what the official LLG parser uses, isn't it?
      • No. As noted elsewhere, the official parser can only segment compound cmavo, and otherwise relies on spaces/dots (which it considers the same).


It seems to work quite well, but it could probably be polished up a little on type 4 fu'ivla.



At one point I put some thought into trying to write a context free grammar describing the morphology. (Not a context-free one, however. I was intending to use the grammar in conjunction with a generalized parser, like Tomita's or Early's). Didn't get anywhere. I think that if an NFA can parse it, there should definitely be a CFG which can describe it, however. (CFGs being more powerful language-recognition mechanisms than automata.) --jay

Btw, parsing, even with an NFA is Really Really fast. So even if the process is ugly, then just as long as it works unambiguously, you're all set. -j



Okay. I read thoroughly the brkwords.txt file. The algorithm described there can be coded with a "classic" programming language, which means (according to Church's law) that a Turing machine, or a possibly infinite-state automaton, can morphologically parse (using this algorithm) any lojban text.

However, there still remain two problems :

  • First, we are still not sure that brkwords.txt describes the reverse of the word formation rules described in the refgram. In other words, it is not sure that is is exactly the reverse of the WordFormationAlgorithm. If it is, all is well (we have an algorithm which given a finite-sized word can classify it into a unique lexeme). If it is not (proving it or not), it won't mean anything, because it is not yet proven impossible to invert the formation algorithm by some other way.

  • In the case where brkwords.txt describes the opposite of the formation algorithm (which I doubt, but...), it is very unsatisfying. In particular, running this algorithm requires parsing the text in both directions from any "important" point. More practically, it requires the reader (be it computer or human) to memorize the whole of the text (when it is not cmavo-prefixed) upto the first pause (which may occur at an arbitrary point of time) before starting to parse. Although this is allowable for computers, I find this constraint very cumbersome as a human listener.


At any rate, I would far prefer to invert the word formation algorithm.

  • All very well, except that there is no WFA for Type 4 fu'ivla. — Well, we have a general rule to recognize brivla, and morphology rules for every other type of brivla. Therefore a parser which can recognize a brivla will recognize type 4 fu'ivla using its "default" brivla rule.


It does not mean that a listener has to remember the string all the way up until the first pause. Humans are capable of performing lexical, syntactic and semantic processing of language simultaneously. As the lexical part of your mind tries to split up words, the syntactic part is looking to see if the series of words is valid, while the semantic part is looking to see if it makes sense. You'll discard parses which aren't syntactically or semantically valid. Further, there are parsing algorithms which won't be tripped up by bidirectionality of processing, so a computer isn't going to have a very big problem with it, just as long as it isn't ambiguous. --jay


Created by JayKominek. Last Modification: Monday 22 of September, 2003 22:14:24 GMT by JayKominek.