Letter Frequencies

I've just generated new letter frequency data based on all but the
first section of:

test_sentences.txt

So basically, the CLL, Alice, and a bunch of IRC. If people would
like to suggest other non-trivially sized Lojban texts to add,
please let me know, but we've got ~650K characters here, so I think
the statistics is pretty good.

My data, sorted by number of occurences:

85004 i
68959 a
52225 e
50517 u
47944 o
43807 l
36358 n
33169 c
27097 m
24514 r
22989 s
21356 d
20536 '
18317 t
17749 k
14459 b
13359 p
11990 j
8810 g
8007 z
6857 v
6616 x
6288 f
4580 y


As ratios:

0.130472888242183 i
0.105845370809523 a
0.080160305261493 e
0.077538691065483 u
0.073589385839292 o
0.067239492438300 l
0.055806000549495 n
0.050911195121464 c
0.041591264560472 m
0.037626610305031 r
0.035285883344307 s
0.032779386867677 d
0.031520766469124 '
0.028114816878406 t
0.027242992016969 k
0.022193161393507 b
0.020504768175936 p
0.018403486071523 j
0.013522494769818 g
0.012289967720991 z
0.010524829357167 v
0.010154917752226 x
0.009651469592805 f
0.007029855396795 y

The only previous work on this I'm aware of is:

The Scrabble Paper

Which, it turns out, is amazingly flawed (which is fine, because
that was a long time ago!).

Using the data without lujvo, we have:

i 1045
a 991
u 642
n 563
e 496
r 460
o 395
t 361
c 360
l 348
s 339
' 316
k 285
m 254
j 249
d 219
b 212
p 203
f 149
g 146
v 119
x 108
z 87
y 19

which is only marginally different from what I have.

Using the data with lujvo, however, which IIRC is what the Scrabble
frequencies were based on, we have the obviously biased:

y 5553
r 2979
a 2949
i 2678
n 2047
u 1755
e 1560
l 1395
s 1363
t 1359
k 1107
m 1048
o 1046
c 1040
' 1012
j 1008
p 872
b 865
d 862
f 616
g 589
x 532
v 490
z 359

-Robin Lee Powell


Created by rlpowell. Last Modification: Sunday 05 of December, 2004 09:55:55 GMT by arj.