Lojban Speech Recognition semester-project

Posted by Anonymous on Wed 16 of Jul, 2008 21:31 GMT

Hi guys,

We have got a request a hopefully some of you are willing to help us. We are
currently studying cognitive science at the university of osnabrueck and
participating in a course called "practical natural language processing",
which is some kind of semester project in lingusitics. Our group decided to
deal with some speech recognition and because lojban has so nice phonetic
features we choose it as our target language, Unfortunately we discovered
that there is very few (usable) lojban audio data on the web, but we
actually need huge amounts of them to feed our training algorithms. It would
be really cool if some of you could actually send us some audio data we can
work with, if you do so please provide them in the following format:

- 16bit mono, 16khz
- preferable raw or wav data files
- one sentence per audio file
- a transcript text file containing one sentence per line + the name of the
audio file in which the sentence was uttered

Everybody who sends as applicable data will be mentioned by name in our
final term paper, which will be published at the end of this month (You see
will really need those data quick).

Thanks a lot for your effort,
Nico & Thorben

Link

Posted by cmacis on Thu 17 of Jul, 2008 10:38 GMT posts: 85

Random sentences okay or should they be part of a bigger prose? I could
churn out loads tomorrow (unless something happens), but I'm afk today to
help out at my uni. My pronunciation needs practise, but is mostly okay.
Also, wav is very big, how do you want us to send you loads of recordings in
wav?

2008/7/16 Nico MÃ¶ller <[email protected]>:

> Hi guys,
>
> We have got a request a hopefully some of you are willing to help us. We
> are currently studying cognitive science at the university of osnabrueck and
> participating in a course called "practical natural language processing",
> which is some kind of semester project in lingusitics. Our group decided to
> deal with some speech recognition and because lojban has so nice phonetic
> features we choose it as our target language, Unfortunately we discovered
> that there is very few (usable) lojban audio data on the web, but we
> actually need huge amounts of them to feed our training algorithms. It would
> be really cool if some of you could actually send us some audio data we can
> work with, if you do so please provide them in the following format:
>
> - 16bit mono, 16khz
> - preferable raw or wav data files
> - one sentence per audio file
> - a transcript text file containing one sentence per line + the name of the
> audio file in which the sentence was uttered
>
> Everybody who sends as applicable data will be mentioned by name in our
> final term paper, which will be published at the end of this month (You see
> will really need those data quick).
>
> Thanks a lot for your effort,
> Nico & Thorben
>

Link

Posted by cizra on Thu 17 of Jul, 2008 10:54 GMT posts: 9

You could rip off the audio from the parrot sketch.

Or we could organize a VoIP radio-theatre (or whatever this stuff is called).

Hmm, sounds like an idea. Get a bunch of volunteers, find a story
that's full o' dialogue (perhaps a narrator could be used), record the
whole thing in VoIP. More learning material for everyone. Joy.
Happiness.

The problem is lack of suitable literature. La Nicte Cadzu has
probably pieces we could use and Alice is a probable candidate as
well. Or we could try to translate a suitable work from, eg, English.

Ideas? Volunteers?

To unsubscribe from this list, send mail to [email protected]
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to [email protected] for help.

Link

Posted by Anonymous on Thu 17 of Jul, 2008 11:55 GMT

Random sentences are quite ok, we ourself recorded some sentences from
Alice, but send us whatever you have got, as long we got a transcript of
what was uttered it would be totally sufficient.

I know that uncompressed audio files are quite big, but hey its only 16bit
mono and of course you can compress them using zip, 7z or whatever you like
;). I think then it shold be no Problem to send them via mail. Or you can
use some free filehosting on the web and send us the links. Just be
creative... If none of theses methods should be appropriable just send them
in a format (mp3, etc.) we can convert back into wavs...

Thanks a lot for your help,
Nico

On Thu, Jul 17, 2008 at 12:36 PM, james riley <[email protected]> wrote:

> Random sentences okay or should they be part of a bigger prose? I could
> churn out loads tomorrow (unless something happens), but I'm afk today to
> help out at my uni. My pronunciation needs practise, but is mostly okay.
> Also, wav is very big, how do you want us to send you loads of recordings in
> wav?
>
> 2008/7/16 Nico MÃ¶ller <[email protected]>:
>
> Hi guys,
>>
>> We have got a request a hopefully some of you are willing to help us. We
>> are currently studying cognitive science at the university of osnabrueck and
>> participating in a course called "practical natural language processing",
>> which is some kind of semester project in lingusitics. Our group decided to
>> deal with some speech recognition and because lojban has so nice phonetic
>> features we choose it as our target language, Unfortunately we discovered
>> that there is very few (usable) lojban audio data on the web, but we
>> actually need huge amounts of them to feed our training algorithms. It would
>> be really cool if some of you could actually send us some audio data we can
>> work with, if you do so please provide them in the following format:
>>
>> - 16bit mono, 16khz
>> - preferable raw or wav data files
>> - one sentence per audio file
>> - a transcript text file containing one sentence per line + the name of
>> the audio file in which the sentence was uttered
>>
>> Everybody who sends as applicable data will be mentioned by name in our
>> final term paper, which will be published at the end of this month (You see
>> will really need those data quick).
>>
>> Thanks a lot for your effort,
>> Nico & Thorben
>>
>
>

Link

Posted by arj on Thu 17 of Jul, 2008 14:13 GMT posts: 953

On Wed, Jul 16, 2008 at 11:29:47PM +0200, Nico MÃ¶ller wrote:

> It would
> be really cool if some of you could actually send us some audio data we can
> work with, if you do so please provide them in the following format:
>
> - 16bit mono, 16khz
> - preferable raw or wav data files
> - one sentence per audio file
> - a transcript text file containing one sentence per line + the name of the
> audio file in which the sentence was uttered

We have a few hours of recordings of spontaneous speech, together with transcriptions, here:
http://www.lojban.org/tiki/tiki-index.php?page=Story+Time+With+Uncle+Robin&bl

You "only" need some volunteers to sentence-align it. :-)

Out of curiosity:

Lojban claims to be "self-segregating", which means that if you know the phoneme string, and you know the stress pattern, you also know how to separate it into words. Will you be taking advantage of this in your model?

--
Arnt Richard Johansen http://arj.nvg.org/
"I had to translate this sentence into English because I could not read the
original Sanskrit." --Douglas Hofstadter

To unsubscribe from this list, send mail to [email protected]
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to [email protected] for help.

Link

Posted by kupesid on Thu 17 of Jul, 2008 16:34 GMT posts: 5

Coi

I'm not sure if you are aware of http://jbobac.lojban.org/ which has
some examples with transcript. A series of words (with transscript) can
be found at http://allalone.org/cizra/ which is intended as a
pronunciation guide.

mu'o mi'e .laxris

Nico Mller wrote:
> Random sentences are quite ok, we ourself recorded some sentences from
> Alice, but send us whatever you have got, as long we got a transcript of
> what was uttered it would be totally sufficient.
>
> I know that uncompressed audio files are quite big, but hey its only 16bit
> mono and of course you can compress them using zip, 7z or whatever you like
> ;). I think then it shold be no Problem to send them via mail. Or you can
> use some free filehosting on the web and send us the links. Just be
> creative... If none of theses methods should be appropriable just send them
> in a format (mp3, etc.) we can convert back into wavs...
>
> Thanks a lot for your help,
> Nico
>
> On Thu, Jul 17, 2008 at 12:36 PM, james riley <[email protected]> wrote:
>
>> Random sentences okay or should they be part of a bigger prose? I could
>> churn out loads tomorrow (unless something happens), but I'm afk today to
>> help out at my uni. My pronunciation needs practise, but is mostly okay.
>> Also, wav is very big, how do you want us to send you loads of recordings in
>> wav?
>>
>> 2008/7/16 Nico Mller <[email protected]>:
>>
>> Hi guys,
>>> We have got a request a hopefully some of you are willing to help us. We
>>> are currently studying cognitive science at the university of osnabrueck and
>>> participating in a course called "practical natural language processing",
>>> which is some kind of semester project in lingusitics. Our group decided to
>>> deal with some speech recognition and because lojban has so nice phonetic
>>> features we choose it as our target language, Unfortunately we discovered
>>> that there is very few (usable) lojban audio data on the web, but we
>>> actually need huge amounts of them to feed our training algorithms. It would
>>> be really cool if some of you could actually send us some audio data we can
>>> work with, if you do so please provide them in the following format:
>>>
>>> - 16bit mono, 16khz
>>> - preferable raw or wav data files
>>> - one sentence per audio file
>>> - a transcript text file containing one sentence per line + the name of
>>> the audio file in which the sentence was uttered
>>>
>>> Everybody who sends as applicable data will be mentioned by name in our
>>> final term paper, which will be published at the end of this month (You see
>>> will really need those data quick).
>>>
>>> Thanks a lot for your effort,
>>> Nico & Thorben

--

e'osai ko sarji la lojban.
http://lojban.org Please! Support Lojban.

To unsubscribe from this list, send mail to [email protected]
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to [email protected] for help.

Link

Posted by cizra on Thu 17 of Jul, 2008 19:08 GMT posts: 9

> at http://allalone.org/cizra/ which is intended as a pronunciation guide.

Note that this contains English stuff, so it'd have to be cut apart,
which is tedious.

To unsubscribe from this list, send mail to [email protected]
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to [email protected] for help.

Link

Posted by Anonymous on Thu 17 of Jul, 2008 21:15 GMT

On Wed, Jul 16, 2008 at 4:29 PM, Nico MÃ¶ller <[email protected]> wrote:
> Hi guys,
>
> We have got a request a hopefully some of you are willing to help us. We are
> currently studying cognitive science at the university of osnabrueck and
> participating in a course called "practical natural language processing",
> which is some kind of semester project in lingusitics. Our group decided to
> deal with some speech recognition and because lojban has so nice phonetic
> features we choose it as our target language, Unfortunately we discovered
> that there is very few (usable) lojban audio data on the web, but we
> actually need huge amounts of them to feed our training algorithms. It would
> be really cool if some of you could actually send us some audio data we can
> work with, if you do so please provide them in the following format:

It looks from the thread like you've got a bit of Lojban audio
available, but having more voices will probably benefit your project,
yes?

The attached zip file is full of me speaking in Lojban. Note that I
have a nonstandard-but-not-incorrect habit of pronouncing the Lojban
.y'y (apostrophe) /h/ as Î¸, which my in first batch of recordings I
tried to avoid but found myself stumbling over every word with an .y'y
in it. The sentences are an excerpt from lapoi pelxu ku'o trajynobli,
and a transcript is as follows:

(kydypa) ni'oda'o ko'a goi so'e le cpare na'o cuxne le frili pluta
mu'inaibo lu le dargu pe lo xamgu bangu cu kargu li'u ka'u

(kydyre) .i li'a le frili pluta co'u ranji gi'enai tcena le cmana jipno

(kydyci) .ije go'i ja'e lenu ko'a za'o litru lo cuksmi

(kydyvo) .ije ko'a co'anai djuno la'edi'u

(kydymu) .i ku'i ma'a na'o jifsruma ledu'u mintu fele ba'e zumcpare
fale broda cei pupyjmina be lo cpare cabra

(kydyxa) .i ba le mo'u broda ku ku su'a krici ledu'u ma'a ca klama
vuga'u le cmana salpo

Craig B. Daniel / .kreig.daniyl.

Link

Posted by PierreAbbat on Sat 19 of Jul, 2008 03:14 GMT posts: 324

ftp://phma.optus.nu
If you need more, let me know! And should I record the song sung?

Pierre

To unsubscribe from this list, send mail to [email protected]
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to [email protected] for help.

Link

Posted by Anonymous on Sun 27 of Jul, 2008 17:05 GMT

Nico Mller wrote:
> Unfortunately we discovered that there is very few (usable) lojban audio
> data on the web, but we actually need huge amounts of them to feed our
> training algorithms. It would be really cool if some of you could
> actually send us some audio data we can work with,

Instead of collecting random bits of audio, it occurs to me that the
community could devise a short sample corpus of Lojban text that could
then be recorded as spoken by a wide variety of different accents,
speech rhythms, mis-pronunciations, etc.

A good place to start would be a Lojban pangram0, but an ideal
training set would include most/all legal two-letter combinations.
Would it be crazy to consider the shortest meaningful text that included
all cmavo and lujvo? Probably ...

0 a short text containing every letter in the alphabet, e.g.
http://en.wikipedia.org/wiki/The_quick_brown_fox

-- Steve

To unsubscribe from this list, send mail to [email protected]
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to [email protected] for help.

Link

Posted by spheniscine on Mon 28 of Jul, 2008 06:11 GMT posts: 10

A Lojban pangram: *.o'i mu xagji sofybakni cu zvati le purdi *(Watch out,
five hungry Soviet cows are in the garden)

On Mon, Jul 28, 2008 at 1:02 AM, Steve Sloan <[email protected]> wrote:

> Nico MÃ¶ller wrote:
>
>> Unfortunately we discovered that there is very few (usable) lojban audio
>> data on the web, but we actually need huge amounts of them to feed our
>> training algorithms. It would be really cool if some of you could actually
>> send us some audio data we can work with,
>>
>
> Instead of collecting random bits of audio, it occurs to me that the
> community could devise a short sample corpus of Lojban text that could then
> be recorded as spoken by a wide variety of different accents, speech
> rhythms, mis-pronunciations, etc.
>
> A good place to start would be a Lojban pangram0, but an ideal training
> set would include most/all legal two-letter combinations. Would it be crazy
> to consider the shortest meaningful text that included all cmavo and lujvo?
> Probably ...
>
>
> 0 a short text containing every letter in the alphabet, e.g.
> http://en.wikipedia.org/wiki/The_quick_brown_fox
>
> — Steve
>
>
>
>
> To unsubscribe from this list, send mail to [email protected]
> with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
> you're really stuck, send mail to [email protected] for help.
>
>

Link

Lojban In General

Lojban Speech Recognition semester-project

Search Lojban Resources

Lojban In General

Thread actions

Search Lojban Resources