+ All Categories
Home > Documents > Gabriel Altmann, Fan Fengxiang (Editors)

Gabriel Altmann, Fan Fengxiang (Editors)

Date post: 06-Jul-2018
Category:
Upload: paula-carolina
View: 231 times
Download: 1 times
Share this document with a friend

of 183

Transcript
  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    1/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    2/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    3/183

    Analyses of Script

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    4/183

    Quantitative Linguistics   63

    Editors

    Reinhard KöhlerGabriel AltmannPeter Grzybek

    Mouton de GruyterBerlin · New York

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    5/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    6/183

    Mouton de Gruyter (formerly Mouton, The Hague)is a Division of Walter de Gruyter GmbH & Co. KG, Berlin.

    Printed on acid-free paper which falls within the guidelinesof the ANSI to ensure permanence and durability.

    Library of Congress Cataloging-in-Publication Data

    Analyses of script : properties of characters and writing systems / edi-ted by Gabriel Altmann and Fan Fengxiang.

    p. cm. (Quantitative linguistics ; 63)

    Includes bibliographical references and index.ISBN 978-3-11-019641-2 (hardcover : alk. paper)1. Writing Mathematical models. I. Altmann, Gabriel. II. Feng-

    xiang, Fan, 1950P211.A555 2008411dc22

    2008008072

    Bibliographic information published by the Deutsche Nationalbibliothek 

    The Deutsche Nationalbibliothek lists this publication in the DeutscheNationalbibliografie; detailed bibliographic data are available in the Internetat http://dnb.d-nb.de.

    ISBN 978-3-11-019641-2

    ISSN 0179-3616

     Copyright 2008 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin.All rights reserved, including those of translation into foreign languages. No part of this bookmay be reproduced in any form or by any means, electronic or mechanical, including photo-

    copy, recording, or any information storage and retrieval system, without permission in writ-ing from the publisher.Cover design: Martin Zech, Bremen.Printing and binding: Hubert & Co. GmbH & Co. KG, Göttingen.Printed in Germany.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    7/183

    Preface

    For a linguist, script is something that does not belong to language, it issomething secondary, left to culture scientists and historians. But for teachersof native languages, orthographers, cryptographers, paleographers, grapholo-gists and especially pupils in elementary and grammar schools, it is an objectof primary importance. For all these groups, script is something to be solved,

    to be used in making inferences about epochs, persons or contents, or to getgood grades for. For computer linguists, it is a practical problem of mechani-cal conversion from written to spoken language or vice versa. Everybody usesscript but nobody cares for its inner life in which perhaps there is some kindof self-regulation or control.

    A group of researchers, not believing in some older unsuccessful endeav-ours to find some essential properties in script, but leaning against the wayof thinking in quantitative and synergetic linguistics, started an experiment in

    conjecturing, quantifying and measuring the properties of script and seekingmodels of their behavior. This volume presents the results of their research.The results are surprising. Letters or other symbols have complexity, distinc-tivity, representativity, utility, grapheme size, phonemic load, ornamentality,uncertainty, dimension and perhaps a series of other properties which wait tobe established. Some of the properties are associated with one another, someof them compete and there probably is a control cycle which may becomebasis of a future theory of script.

    The researchers considered only four script types, namely Latin, Oriya,Japanese and Old Egyptian, which is, of course, not enough to draw gen-eral conclusions, but at least a start has been made. Some common problemshave been analysed using English, Italian, Swedish, Slovak, Slovenian andGerman. For Oriya a new weighted distributional calculus has been draftedusing letter form and positioning; for the first time the strange and unexpectedway of simplification of hieroglyphs has been expressed quantitatively, leav-ing open the question of measurement of change from iconism to symbolism;for Japanese the dependence of frequency on polytextuality of kanji has beenmodelled, and a look has been cast at the capacity dimension of signs. Andlast but not least, a first draft of a future theory of script has been ventured.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    8/183

    vi

    For quantitative linguists, a theory is a set of interrelated hypotheses, of which at least one is a law. Though in the present volume laws have not been

    established – it is a very long way before a statement can be considered alaw – a network of hypotheses has been set up and instructions for continuingthis work have been offered in the last contribution. Needless to say, furtherdevelopment can fundamentally change the direction of research and the re-sults presented here may become only peripheral consequences of a deepertheory – a usual event in the evolution of science.

    There are, of course, also practical aspects of this investigation. The rep-resentation of phonemes, some load and utility problems etc. can show quan-

    titatively whether an orthographic reform is necessary or whether it is too lateto perform one. The resulting numbers must simply be adequately interpreted.The numbers are objective and have nothing to do with the speakers’/writers’intuition or national feeling.

    The results can be considered as quantitative descriptions of some writingproblems of individual languages but for the involved analysts it is an attemptat laying the foundations of a discipline not existing up to now, opening newvistas and embedding this object in the scope of synergetic linguistics.

    Finally, we would like to express our sincere gratitude to Peter Grzybekand Reinhard Köhler as the editors of the series Quantitative Linguistics, whohave enthusiastically accompanied the whole process of this book cominginto existence, from the first ideas to the preparation of the layout. In thisrespect, our thank also goes to Veronika Koch for her competent technicalhelp.

    Gabriel Altmann (Lüdenscheid, Germany)Fan Fengxiang (Dalian, China)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    9/183

    Contents

    PrefaceGabriel Altmann and Fan Fengxiang   v

    I. Introduction

    Quantitative analysis of writing systems: an introduction 3 Reinhard Köhler 

    II. The phoneme-grapheme relationThe phoneme-grapheme relationship in Italian 13Gerald Bernhard and Gabriel Altmann

    Graphemic representation of English phonemes 25Fan Fengxiang and Gabriel Altmann

    The phoneme-grapheme relationship in Slovene 61

     Emmerich Kelih

    On the distribution of graphemic representations 75 Ján Maˇ cutek 

    The phoneme-grapheme relation in Slovak 79 Emília Nemcová and Gabriel Altmann

    III. Special problems

    Script ornamentality 91Karl-Heinz Best and Gabriel Altmann

    On the decrease of complexity from hieroglyphs to hieratic symbols 105 Ina Hegenbarth-Reichardt and Gabriel Altmann

    The fractal dimension of script: an experiment 115

     Reinhard Köhler On graphemic representation of the Oriya phonemes 121Panchanan Mohanty and Gabriel Altmann

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    10/183

    viii   Contents

    On the relation between types and tokens of Japanese morae 141Katsuo Tamaoka

    IV. Towards a theory

    Towards a theory of script 149Gabriel Altmann

    Authors’ Addresses 165

    Author Index 167

    Subject Index 169

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    11/183

    I. Introduction

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    12/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    13/183

    Quantitative analysis of writing systems: anintroduction

     Reinhard Köhler 

    1 Introduction

    The cultures in the world use various, quite different writing systems (scripts)to fix linguistic material. Linguists distinguish between two principally dif-ferent categories: logographic (subdivided into pictographic, ideographic andabstract-logographic) and phonographic (subdivided into segmental, syllabicand alphabetic) scripts (cf. Table 1). Sometimes, logographic, syllabic andalphabetic principles occur in a mixture.

    Table 1: Categories of scripts with examples

    logographic pictographic   T

    ideographic

    abstract-logographic   §

    phonographic sound-segmental

    syllabic

    alphabetic   ˆş<

    mixed systems various kinds  

    A recent increasing interest in quantitative descriptions of graphical sym-bols and scripts can be observed in linguistics. The present contribution aimsat giving an overview of measurable properties of signs and sign systems, aswell as of functional dependences among symbol properties.

    2 Properties

    We should distinguish between properties of individual signs and propertiesof sign systems. Properties such as frequency of occurrence, complexity, pho-

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    14/183

    4   Reinhard Köhler 

    neticity etc. can be attributed to individual signs, whereas scripts can be char-acterised in terms of inventory size, entropy, efficiency, learnability etc. Many

    properties seem to be determinable with respect to individual signs and to sys-tems as well, such as ambiguity, distinctivity etc. However, in each of thesecases we observe, in fact, two different features, i.e. these terms are ambigu-ous themselves. E.g., ambiguity can be measured with respect to differentindividual signs, whereas ambiguity with respect to scripts will probably bedefined as the mean ambiguity of all the signs belonging to the given script,or in a similar way, thus reflecting some kind of a global property.

    Another distinction should be kept in mind, viz. the distinction between

    language-dependent and language-independent properties. Many scripts arein use for more than one language. Therefore, some of the properties of signsand even of scripts depend on their function in the given language. A simpleexample can be seen in the fact that the letter of the Roman alphabetrepresents a single sound in English, viz.  /z/, whereas, in German and Ital-ian, it stands for two sounds: /ts/, in Italian also for  /dz/. The letter isunambiguous in Swedish; in German, its pronunciation is /s/ or /z/, depend-ing on its position and context. The Roman alphabet, as used for the German

    language, has in Germany one letter more than it has in Switzerland, wherethe is replaced by in all cases.Before any attempt to find a promising measure of a property, one has

    to clarify how the corresponding units should be defined. The basic unitsof alphabetic scripts seem to be clear at a first glance. However, differentauthors use different definitions. Some authors consider as the basic unit, thegrapheme, any letter or combination of letters which represents a sound. Thepresent author prefers the following definition:

    Definition 1A grapheme is any graphical sign which, on its own, represents in at least one 

    context a portion of linguistic material. Hence, the letter is a grapheme 

    regardless of the fact that it appears also in sequence with for another 

    sound. On the other hand, diacritics such as accents would not be consid-

    ered as graphemes but as parts of complex graphemes because they do not 

    represent any sound, sound combination, word, or meaning. They are rather 

    distinctive features which serve to differentiate graphemes. Sequences such

    as will then be considered as syntagmas.

    However, another point of view may support Altmann’s variant: sequencesrepresenting a single sound could also be considered as compound graphemes.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    15/183

    Quantitative analysis of writing systems: an introduction   5

    Anyway, the appropriateness of any definition of a unit or a property dependscrucially on the approach and the purpose of the given investigation and on its

    compatibility with other definitions within the given approach. Therefore, re-sults of scientific studies can only be compared if all units and properties areexplicitly defined (or, in advanced fields, are common among all researchers).

    The same is true of the measures which are employed to determine theproperties under study. Let us consider, as an illustrative example, the com-plexity of individual signs. Bohn (2002) operationalised complexity in termsof the number of strokes a Chinese character consists of (cf. Figure 1a). Thismeasure works perfectly for Chinese; for other scripts, however, the stroke

    inventory would have to be defined in a different way – if possible at all. Alt-mann (2004) avoids this difficulty. He proposes and uses a measure accordingto Figure 1b, assigning different scores to dots, straight lines and arches onthe one hand and continuous, crisp, and crossing connections on the otherhand. Another method is preferred by Peust (2006): he defines complexity interms of the maximum number of intersections with a straight line. Figure 1cgives an illustration, which, at the same time, shows the limitations of thisapproach, since there is no position or angle of a straight line which would

    correspond to the intuitive complexity of the given symbol.

    (a) Bohn

    Form: Connection:

    dot 1 point continuous 1 pointstraight line 2 points crisp (sharp) 2 pointsarch 3 points crossing 3 points

    (b) Altmann

    (c) Peust

    Figure 1: Three different methods to determine ‘complexity’

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    16/183

    6   Reinhard Köhler 

    Another important aspect should also be emphasised: Giving a propertya name, such as complexity, does not suffice to find an appropriate measure,

    of course. Before metricising (quantifying) or even measuring, the conceptbehind the term must be clarified; otherwise any definition or operational-isation will be unsatisfying or even useless and misleading. Our example,complexity, may serve to illustrate this aspect. A closer look at what can beunderstood under the term unveils that quite a number of different conceptsmay be connected with it, depending on the specific interest of the researcherand of the interrelations one has in mind. Let us consider only two perspec-tives on complexity, viz. complexity from the point of view of the writer,

    and complexity from the point of view of the reader. We shall call these twoperspectives Production Complexity and Decoding Complexity, respectively.Additionally, we will consider the fact that both kinds of complexity can bemeasured with respect to different kinds of effort. Again, we shall take intoaccount only two of them: Muscular/Nervous Effort and Cognitive Effort.Combinations of perspectives with efforts yield four different bases for oper-ationalisation:

    1. Production Complexity in terms of Muscular/Nervous Effort

    2. Production Complexity in terms of Cognitive Effort3. Decoding Complexity in terms of Muscular/Nervous Effort4. Decoding Complexity in terms of Cognitive Effort.

    There are certainly more than two perspectives and also more than twoforms of effort connected with them. Moreover, it is easy to find more thanthese two aspects, perspective and effort form, which should be taken intoaccount when complexity is concerned.1 And clearly, the specific selection of (a combination of) aspects determines the way a property can be measured.Let us follow up our example on complexity and discuss the possibilities tofind operationalisations according to the different aspects.

    One possibility to measure cognitive effort of sign production is to deter-mine the number of different elements needed for the given sign. Of course,a measuring procedure is to be preferred if it can be applied mechanically, oreven better, automatically. This is possible by taking the number of trajecto-ries a sign consists of. Computer fonts in vector representation describe foreach symbol the Bézier curves which specify it. Evaluating a font in this way

    1. There are other aspects, of course, which are not connected with the ones discussed sofar. Peust’s measure for example does not reflect any of these aspects but corresponds toa rather abstract, topological concept.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    17/183

    Quantitative analysis of writing systems: an introduction   7

    enables automatic processing of the corresponding data. Figure 2 illustratesthe procedure.

    Figure 2: Number of Bézier curves as specifications of the trajectories

    Although this procedure looks similar to Altmann’s approach there is asignificant difference. With respect to the production effort, the existence of an intersection does not matter. Drawing two strokes such as “/\” causes thesame effort as drawing the two intersecting strokes in the letter “X” in ourcase.

    However, if production complexity with respect to muscular and/or ner-vous effort is to be measured we have to take into account the way the writing

    instrument is used. Different instruments will require a different number of movements for the same sign. Thus, drawing an “R” with a pencil requiresadditional movements if the vertical line is drawn from top to bottom. Thenthe pencil has to be lifted, moved back to the starting position, and loweredagain. In any case, an “X” requires an extra movement (cf. Figure 3). Clearly,the situation is different if signs are produced with hammer and chisel, andagain different when a typewriter or computer is used. Furthermore, the sizeof the signs cannot be ignored. Moreover, to have a more realistic picture of production effort, length and angles of the curves should be taken into ac-count, and one should not forget that the effort connected with drawing aconcave curve (with the right hand), for example, is less than that of drawinga straight line. Also changes of movement direction cause effort etc. Theseconsiderations show that properties as concepts cannot be taken for granted.The more realistic the measure of effort, the more doubts arise as to whethereffort is an appropriate operationalisation of complexity at all. If not – whatelse is complexity? Or, ornamentality?

    Analogously, decoding (or recognition) effort can be measured in severalways. One of them is the measurement of the time a person needs,  ceteris paribus, to recognise a given sign – a rather impractical method. Another oneis to analyse the signs with respect to their distinctive elements. However,

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    18/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    19/183

    Quantitative analysis of writing systems: an introduction   9

    is under study, the corresponding mathematical model will be a frequency dis-tribution. Every new scientific field begins with individual properties, study-

    ing their distributions and their dependences or effects on other individualproperties. In the beginning years of quantitative linguistics, the study of wordfrequencies was predominant. Later, other properties of words were detected,such as length and polysemy, and their interrelations with frequency werestudied pairwise. The simultaneous investigation of more than two propertiesof words, i.e. more-dimensional studies, is a recent innovation. In our days,we have the means to model rather complex networks of properties and theirinterrelations, including theory of dynamic behaviour, thanks to systems the-

    ory in general and synergetics in particular. Synergetic models of linguisticphenomena enable us to set up complex models with explanatory power (onthe basis of functional explanation, cf. Köhler 1986, 2005).

    The present volume gives examples of measures, distributions, and func-tions concerning properties of scripts and signs, and introduces a first attemptat a synergetic model of a complex network of script properties.

    References

    Altmann, Gabriel2004 “Script complexity.” In: Glottometrics, 8; 68–73.

    Bohn, Hartmut2002 “Untersuchungen zur chinesischen Sprache und Schrift”. In: Köhler,

    Reinhard (Ed.), Korpuslinguistische Untersuchungen zur quantitativenund systemtheoretischen Linguistik ; 127–177.  [http://ubt.opus.hbz-nrw.de/volltexte/2004/279]

    Köhler, Reinhard, Altmann, Gabriel1983 “Systemtheorie und Semiotik.” In: Zeitschrift für Semiotik , 5(4); 424–

    431.Köhler, Reinhard

    1986   Zur linguistischen Synergetik. Struktur und Dynamik der Lexik.   Bo-chum: Brockmeyer.

    2005 “Synergetic linguistics”. In: Köhler, Reinhard; Altmann, Gabriel; Pio-trowski, Rajmund G. (Eds.), Quantitative Linguistics. An International Handbook . Berlin / New York: Mouton de Gruyter, 760–774.

    Peust, Carsten2006 “Script complexity revisited.” In: Glottometrics, 12; 11–15.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    20/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    21/183

    II. The phoneme-grapheme relation

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    22/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    23/183

    The phoneme-grapheme relationship in Italian

    Gerald Bernhard and Gabriel Altmann

    1 Introduction

    The graphemic representation of phonemes in a language depends to a con-siderable extent on its history, on the type of script (letters, syllables, moras,ideograms, mixed scripts), on the strength of foreign influence, on the time of the introduction of the script, on the number of writing reforms, on the unityof the contiguity of the area where the language is spoken (cf. English vs.Danish), etc. The graphemic representation can accelerate or brake the com-munication; it is important for the teaching of the native language and thesecond language. The representation has different properties which becomemanifest only after having been quantified.

    In the present contribution we shall only examine those properties thathave been defined by Best and Altmann (2005), namely (1) the orthographicuncertainty of a phoneme, (2) the distribution of grapheme size, (3) the gra-phemic exploitation of letters and (4) the positional participation of letter ingraphemes. Since Italian uses a letter script, a direct comparison of resultswith German and Swedish is possible. Letters are the symbols of the Latinalphabet, and graphemes are also their combinations and letters with diacrit-ical marks (cf. „e“  and, „è“  is), which can mark also the position of accent(cf. „meta“ vs. „metà“).

    Italian took over the Latin alphabet and adopted it for its own purposes.The representation of individual phonemes by graphemes is shown in Ta-bles 1a and 1b. Long vowels and long consonants are considered separatephonemes; some consonants obtain the status of allophones because they oc-cur in complementary distribution.

    2 The orthographic uncertainty of phonemes

    As can be seen in Table 1, individual phonemes are represented by differentnumbers of graphemes. In the ideal case each grapheme should correspond to

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    24/183

    14   Gerald Bernhard and Gabriel Altmann

    Table 1a: The phoneme-grapheme relation in Italian: vowels (* = borrowings)

    Phonemes Graphemes Examples

    /i/     chicco; hippie/hippy*; così/i:/     vino/e/     pesce; hegeliano*; perché; eh!/e:/     pelo/E/     pesca; herpes*/E:/     bene; bebè/a/     bacca; hanno; ah!/a:/     baco; città/O/     cogliere; ho; yacht [jOte]/O:/     toro; oblò/o/     torre; holding*; boh!/o:/     volo, boh!/u/     burro; humus*; uh!/u:/     luna; gioventù/ao/     ciao, Paolo/au/     Sabaudia/ai/     zaino/Ei/     eidetico

    /Eu/     Europa

    only one phoneme, but since script is either inherited and does not follow thedevelopment of language, or was taken over, with the number of phonemes inthe target language being originally greater than in the source language, dis-crepancies arise automatically. The number of graphemes must be enlarged,a process carried out by combining letters or adding new symbols. Differ-

    ent phonological processes result automatically in multiple representationsof individual phonemes. Hence these processes give rise to orthographic rep-resentation uncertainties, which can, however, be expressed numerically. If the mean uncertainty surpasses a certain threshold, it is a signal for a writingreform.

    The orthographic uncertainty of a phoneme /x/ can be expressed as

    U / x/   =   log2 n x   (1)

    where U / x/ is the uncertainty of the phoneme /x/, log2 is the logarithm withbasis 2, and n x is the number of graphemes that can represent the phoneme/x/. In Table 2 one can find all Italian phonemes and their uncertainties (U  x,

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    25/183

    The phoneme-grapheme relationship in Italian   15

    Table 1b: The phoneme-grapheme relation in Italian: consonants

    Phonemes Graphemes Examples

    /;R/     toro/;R:/     torre/l/     male/l:/     palla/ń/ [ń, ńń]     gli [ńń]; figlio, maglia, aglio/m/     lama/m:/     mamma/n/     nano/n:/     nanna

    /ñ/ [ñ, ñ:]     gnomo; stagno/f/     afa/f:/     baffi/v/     lava/v:/     davvero/s/     casa, extra/s:/     cassa/z/     rosa, sbaglio/S/ [S, S:]     scemo; pesce, sciame, lasciare/ts/     zio/dz/     zanzara/t:s/     cozza/d:z/     razzo; mazurca/tS/     cena; pace; bacio/t:S/     lacci (pl.), laccio/dž/     gerla, regione; jazz/d:ž /     raggio, laggiù/w/     quello, guaio/j/     chiave, piano; yogurt*; Juventus*

     juventino/p/  

      papa/p:/     pappa/t/     seta/t:/     setta/k/     cane; chino; questo; extra;

    karate*; kit*/k:/     becco; becchi; acqua/b/     bibita/b:/     babbo/d/     guado/d:/     freddo/g/     lago; laghi (pl.); hegeliano*/g:/     leggo (I read); tegghia

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    26/183

    16   Gerald Bernhard and Gabriel Altmann

    with n x denoting the number of representing graphemes, and   f  x the numberof phonemes with uncertainty U  x.

    Table 2: Orthographic uncertainty of Italian phonemes

    Phoneme   n x   U  x   f  x

    /i:/, /e/, /ao/, /au/, /ai/, /Ei/, /Eu/, /;R/, /;R:/, /l/, /l:/,/m/, /m:/, /n/, /n:/, /ñ/, /f/, /f:/, /v/, /v:/, /s:/, /z/,/ts/, /dz/, /t:s/, /w/, /p/, /p:/, /t/, /t:/, /b/, /b:/, /d/,/d:/   1 0 34/E/, /E:/, /a:/, /O:/, /o:/, /u:/, /ń/, /s/, /S/, /d:z/, /tS/,

    /t:S/, /d:ž/, /g:/  2 1 14

    /a/, /O/, /o/, /u/, /dž/, /j/, /k:/, /g/   3 1.58 8/e/   4 2 1/i/, /k/   5 2.30 2

    The mean uncertainty can be computed as the average by means of 

    Ū   =  1 N ∑ x∈ I 

     f  x U  x   (2)

    where N  is the number of all representations. In our case

    Ū  = [34(0) + 14(1) + 8(1.58) + 1(2) + 2(2.32)]/59 = 0.5641.

    Comparing this number with the result from Swedish,  Ū  = 0.797, and withGerman  Ū = 0.965 (cf. Best & Altmann 2005), one could conclude that theItalian orthography is not so vague as the German or Swedish. In order to get

    a more objective image of these differences we set up an asymptotic test forthe difference of two mean uncertainties, i.e.

     z =Ū 1−  Ū 2 

    V ( Ū 1) + V ( Ū 1)(3)

    Here  Ū  is the empirical mean uncertainty, V ( Ū ) is its variance and z is thequantile of the normal distribution. The variance of  Ū  can be derived usingthe Taylor expansion as below:

    V ( Ū ) =V 

    1

     N 

     N 

    ∑ x=1

    log2 n x

    =

      1

     N 2∑V (log2 n x) =

      1

     N 2 ln2 2∑

    1n x

    2 E (n x)

    V (n x)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    27/183

    The phoneme-grapheme relationship in Italian   17

    Since n x is the original variable whose expectation is E (n x) = µ and V (n x) =σ2, we obtain, after substituting in the above formula,

    V ( Ū ) =  σ2

     Nµ2 ln2 2  (4)

    which can be estimated by means of empirical values as

    V ( Ū ) =  s2

    0.48 N ¯ x2  (5)

    In (5) one can easily see that the variance of the uncertainty is a functionof only the well known variation coefficient. For Italian we obtain

    ¯ x =  1 N ∑ x f  x

    = [1(34) + 2(14) + 3(8) + 4(1) + 5(2))/59 = 1.694915

    s2 = 1 N ∑( x− ¯ x)

    2 f  x   =  1 N ∑ x

    2 f  x −  ¯ x2

    = [12(34) + 22(14) + 32(8) + 42(1) + 52(2)]/59−1.6949152 = 0.991669 .Finally,

    V ( Ū ) Ital = 0.991669/[0.48(59)1.6949152] = 0.012189.

    In the same way we obtain the variance for German as  V ( Ū )German =0.012602 and for Swedish V ( Ū )Swed  = 0.022763. If we perform the above

    test on Italian and German, we obtain

     z =  0.9650−0.5641√ 0.012189 + 0.012602

      = 2.55

    and this value is significant, i.e. Italian has a significantly smaller ortho-graphic uncertainty than German. For the difference between Italian and Swe-dish we obtain

     z =  0.7970−0.5641√ 0.012189 + 0.022763

      = 1.24

    which is not significant i.e. Swedish and Italian have roughly the same ortho-graphic uncertainty.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    28/183

    18   Gerald Bernhard and Gabriel Altmann

    3 The distribution of graphemic representations

    If a language using the Latin alphabet has fewer phonemes than there are let-ters in Latin, it can represent each phoneme by one letter. In such a case allfrequencies are concentrated in point x = 1. We speak then of a determinis-tic distribution. But if a language has more phonemes than the Latin letterinventory, it must reach for different means in order to build correspondinggraphemes. One method is introducing marks placed over or under the let-ters, as in Slavic languages; another is using some letters to signalize a spe-cial quality, such as  in German for prolonging the vowel; still another is

    combining or redoubling some letters, e.g. , , or even  inseveral languages. These new forms can, however, be chosen in such a waythat each phoneme can be represented by one unique grapheme. This idealstate is usually considerably disturbed by the interference with morphology orby disregarding the phonological development of language. In this way pho-nemes acquire multiple representations. From the statistical point of view, adistribution of representation sizes of phonemes arises and it can be capturedformally.

    Since up to now only a small number of languages has been processed inthis way, we can start from simple assumptions. At first, we assume that therepresentation size decreases geometrically, i.e. it follows a distribution of theform

    P x   = pq x−1,   x = 1, 2, 3, . . .   (6)

    This is the 1-displaced geometric distribution. For Italian and Swedish thishypothesis would be adequate. However, in German we see (see Table 3) thatthe distribution does not decrease monotonically but has its mode at  x = 2,

    i.e. more phonemes are represented by means of two graphemes than by onegrapheme.

    Table 3: Distribution of representation size of phonemes in three languages

     x   Italian German Swedish

    1 34 10 162 14 18 103 8 7 6

    4 1 3 15 2 0 26   −   1 1

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    29/183

    The phoneme-grapheme relationship in Italian   19

    This circumstance can have different causes which must be analysed in-dividually. In order to keep the original hypothesis, we modify (6) by means

    of the Gram-Charlier expansion (see Shenton & Skees 1970; Mačutek, thisvolume, pp. 75ff.) and obtain

    P x =  pq x−1

    1 + a

     x− 1

     p

    ,   x = 1, 2, 3, . . .   (7)

    where q = 1− p, 0 <  p ≤ 1, 0 ≤ a ≤ 1/q− 1, q = 1− p. This distribution iscalled either Gram-Charlier-geometric or Shenton-Skees-geometric distribu-tion (cf. Wimmer & Altmann 1999). If in (7)  p = 1, we obtain the determinis-

    tic distribution representing the ideal case, and if  a = 0, we obtain the originalgeometric distribution.

    The fitting of (7) to the data in these languages can be seen in Table 4.Evidently the fit is in each case very satisfactory, but the hypothesis cannotbe corroborated better until more languages have been examined. The fit isshown graphically in Figures 1a–1c.

    Table 4: Fitting the distribution (7) to data in Table 3

     x   Italian German Swedish1 33.31 9.99 15.792 14.92 18.00 9.993 6.37 7.54 5.354 2.64 2.47 2.645 1.76 0.73 1.246   −   0.27 1.00 p   0.6488 0.7768 0.6152

    a   0.2398 2.3323 0.4588FG   2 1 2χ2 1.55 0.12 1.36P   0.46 0.73 0.51

    4 Grapheme size

    A grapheme can consist of one or more Latin letters. Because of unequal sizeof Latin letter and phoneme inventories of target languages, new letters (e.g.the German  ) and several additional marks (tilde, accents, etc.) were

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    30/183

    20   Gerald Bernhard and Gabriel Altmann

    1 2 3 4 50

    5

    10

    15

    20

    25

    30

    35

    40

    f(x)

    NP(x)

    (a) Italian

    1 2 3 4 5 60

    5

    10

    15

    20

    f(x)

    NP(x)

    (b) German

    1 2 3 4 5 60

    5

    10

    15

    20

    f(x)

    NP(x)

    (c) Swedish

    Figure 1: Fitting (7) to Italian, German and Swedish data

    introduced. Thus grapheme inventory can be measured in two ways: (i) as

    the number of Latin letters without considering additional marks, (ii) as thenumber of Latin letters plus additional marks. Consequently the German gra-pheme can consist of one symbol according to method (i) and of twosymbols according to method (ii). For Italian we obtain the results on thebasis of Table 1 as shown in Tables 5a and 5b.

    Table 5a: Size of Italian graphemes: method (i)

    Size Grapheme Number

    1   30

    2  

    36

    3     5

    In Table 5b six graphemes with accent passed from size 1 to size 2. Thevariable “size” has too small a support, which does not allow us to set up atestable model. For the time being it is enough to characterize the graphemicsby its average and to compare it with other languages. Using method (i) weobtain the mean size of 1.65 lying between German (1.68) and Swedish (1.61)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    31/183

    The phoneme-grapheme relationship in Italian   21

    Table 5b: Size of Italian graphemes: method (ii)

    Size Grapheme Number

    1     242  

    42

    3     5

    while method (ii) yields the mean size of 1.70 lying also between German

    (1.78) and Swedish (1.67).

    5 The graphemic load of letters

    Latin letters are used with different frequencies in graphemes of target lan-guages. The exploitation of letters for building graphemes can be designatedas graphemic load. One can ask whether the letters present in graphemes havesomething to do with the phonemic relevance or whether they are merely his-torical relicts. In German, the letter occurs in 16 graphemes and its func-tion is both segmental (there is a phoneme  /h/), purely combinatorial (e.g.in the grapheme  ) or suprasegmental, e.g. to prolong the precedingvowel. In Italian  occurs in 13 graphemes, but it plays only a secondaryrole: either it occurs in historically petrified forms or it helps to maintain thephonetic value of the preceding consonant (). In Table 6 the lettersare ordered according to their graphemic load.

    It is not yet possible to set up hypotheses about this distribution becausethe empirical background is still very restricted and the class occupation verysmall. For the time being we must content ourselves with the computation of the mean load which results from the numbers in Table 6 as 98/25 = 3.92.For German we get 3.96, for Swedish 3.36. Italian lies between them.

    6 Letter usefulness

    The participation of letters in building graphemes can be weighted. We as-sume that the role of the letter is the more peripheral the later it appears inthe grapheme. The historical and morphological roles of letters is neglected

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    32/183

    22   Gerald Bernhard and Gabriel Altmann

    Table 6: Graphemic load of Italian letters (Participation in grapheme forming)

    Component in Number of  

     x graphemes Letter letters1 y, x, j, k 42 r, m, f, v, z, p, t, q, b, d 103 n 14 l, s 25 o 16 u 17 e, g 28 a, c 2

    9 i 113 h 1

    in this case. The weighting has a purely positional character. The smaller theweight, the more useful the letter graphemically.

    Let us consider as an example the letter   occurring in the followinggraphemes (see Table 5a/5b): . Let p xgi be the

    product of the position ( p x) of the letter  and the number of graphemesgi in which it occurs in this way. Then the positional participation of a lettercan be defined as

    PP =  ∑gi∈G

     p xgi.   (8)

    If it is weighted in each position by the position itself, then we find posi-tion 1 eight times and position 2 twice, i.e.

    PP = 1(8) + 2(2) = 12 .

    If this operation is performed for each letter, one obtains the results forItalian in Table 7, PP denoting the weight, and   f  x the number of letters.

    There is a possible correlation between the relative frequency of individualletters and their graphemic usefulness. For the time being we can merelycompute the mean positional weight of letters in the form

    PW ( Language) = 1 L∑ x

     f  xPP   (9)

    where L is the size of the letter inventory. For Italian we obtain

    PW ( Italian) = [1(4) + 3(1) + 4(9) + . . . + 22(1)]/25 = 6.48 .   (10)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    33/183

    The phoneme-grapheme relationship in Italian   23

    Table 7: Positional participation of letters in graphemes

    PP   Letter   f  x

    1 y, x, j, k 43 q 14 r, m, f, v, z, p, t, b, d 96 n, s 27 l, o 28 e, a 29 u 1

    12 g 116 c 1

    18 i 122 h 1

    Comparing with Swedish (5.41) and German (6.12) we see that Italian hasa strong letter usefulness (great positional weight). However, it will not bepossible to examine historical and morphological dependencies before manylanguages have been analysed. The same holds for the comparison of indi-

    vidual letters in languages using Latin script and the relationship with theletter/grapheme frequency of occurrence.

    References

    Best, Karl-Heinz; Altmann, Gabriel2005 “Some properties of graphemic systems.” In: Glottometrics, 9; 29–39.

    Mačutek, Ján

    2006 “On the distribution of graphemic representations”. This volume, pp. 75–78.

    Shenton, Leanne R.; Skees, P.1970 “Some statistical aspects of amounts and duration of rainfall”. In: Patil,

    Ganapati P. (Ed.), Random Counts in Scientific Work . University Park:The Pennsylvania State University, 73–94.

    Wimmer, Gejza; Altmann, Gabriel1999   Thesaurus of univariate discrete probability distributions. Essen: Stamm.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    34/183

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    35/183

    Graphemic representation of English phonemes

    Fan Fengxiang and Gabriel Altmann

    1 Introduction

    The grapheme-phoneme analysis of English is radically different from casesanalyzed hitherto (German, Swedish, Italian, Slovak). This is caused by (i) the

    historical origins of English, (ii) its many national and regional varieties and(iii) the borrowing of many foreign words. The grapheme-phoneme mappinghas been examined from different aspects (cf. Adams 1990; Berndt, Reggia,Mitchum 1987; Cunningham & Cunningham 1992; Fry 2004; Hanna et al.1966; Patterson & Morton 1985; Seidenberg et al. 1984), and some probabil-ities have been computed. Such analysis is relevant not only for linguistics butalso for cognition studies and pedagogy. We are interested here only in somemeasurable properties of the English phoneme-grapheme correspondence, in

    order to be able to study later on the divergence or convergence of this rep-resentation. Our analysis focuses on American English and the results cannothold for other varieties, i.e., British English, though the methods used herecan be applied directly. In order to work with controllable data, we adhereto the phonemic/graphemic analysis based on the American Carnegie MellonPronouncing Dictionary hereafter referred to as cmudict 1, which has 129 425entries with phonological transcriptions. The phonological symbols used inthe dictionary are listed below. On typographical grounds we adhere to this

    way of symbolizing phonemes.The dictionary uses 39 consonant and vowel phonemes. In his  Essential

     Introductory Linguistics, Hudson (2000: 24ff.) uses 38 phonemes without thevowel er , which is used in the cmudict , as well as in the World Book Dictio-nary (Barnhart & Barnhart 1979).

    The words analyzed are from the 1 000 000-word Brown Corpus, whichhas 42 436 word types minus the Arabic numerals. Of these words, the cmu-dict  covers 31 591; the uncovered part mostly consists of personal and place

    names, and non-word strings. This phonemic/graphemic analysis is the anal-

    1.   ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/data/anonftp/project/fgdata/dict/

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    36/183

    26   Fan Fengxiang and Gabriel Altmann

    Table 1: List of phonemes

    Phoneme Example Transcription

    /AA/   odd AA D/AE/   at AE T/AH/   hut HH AH T/AO/   ought AO T/AW/   cow K AW/AY/   hide HH AY D/B/   be B IY/CH/   cheese CH IY Z/D/   dee D IY/DH/   thee DH IY/EH/   Ed EH D/ER/   hurt HH ER T/EY/   ate EY T/F/   fee F IY/G/   green G R IY N/HH/   he HH IY/IH/   it IH T/IY/   eat IY T

    /JH/   gee JH IY/K/   key K IY/L/   lee L IY/M/   me M IY/N/   knee N IY/NG/   ping P IH NG/OW/   oat OW T/OY/   toy T OY/P/   pee P IY

    /R/   read R IY D/S/   sea S IY/SH/   she SH IY/T/   tea T IY/TH/   theta TH EY T AH/UH/   hood HH UH D/UW/   two T UW/V/   vee V IY/W/   we W IY

    /Y/   yield Y IY L D/Z/   zee Z IY

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    37/183

    Graphemic representation of English phonemes   27

    ysis of these 31 591 word types using the pronunciation given by the cmudict .The pronunciation of each word in the  cmudict  is in the following form

    (0, 1 and 2 represent word stresses):LABORATORY L AE1 B R AH0 T AO2 R IY0.

    2 Data

    The 31 591 word types from the Brown Corpus were automatically separatedinto graphemes and then paired with their corresponding phonemes with the

    computer in the following form:i|n|au|g|u||r|a|t|io|n|,

     /ih n ao g y ah r ey sh ah n/ ,i:ih|n:n|au:ao|g:g|u:y ah|r:r|a:ey|t:sh|io:ah|n:n|Computerized analysis is error prone, even with the best commercial-

    ized state of the art software. There is no exception in this analysis. Al-though the result was manually checked, there still may be errors. In addi-

    tion, there are indeterminable cases. For example, the first   in the wordLABORATORY is not pronounced in American English; should it be pairedwith the letter  and the phoneme  /b/ or the letter   with the pho-neme /r/? The was finally put together with to become→,or BO:B meaning the grapheme  in this word is pronounced as  /b/.Another possibility would be to consider  as representing nothing, butin that case the analysis would be quite different. The third possibility wouldbe to consider the grapheme    as representing the cluster  /br/, butin that case the analysis would produce an enormous number of phonemes,clusters and graphemic representations. We chose the first alternative, whichyielded a reasonable image of this kind of English.

    On the other hand, we could not avoid the fact that some single graphemesrepresent a group of phonemes, for example   in the word COMPUTERrepresents the group of phonemes /y uw/, or   in BOX represents /k s/.In such cases the given phonemes are (implicit) parts of the graphemic repre-sentation and these cases are marked with ∈, e.g. /k/ →∈ .

    Another problem was the representation of a phoneme by zero grapheme.For example ABLER has the pronunciation of  /ey b ah l er/, in which /ah/is present phonemically but not graphemically. These cases are interpretedas  /ah/ being part of   if   stays in front of   (also ) and

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    38/183

    28   Fan Fengxiang and Gabriel Altmann

    marked as /ah/ →∈ . There are several cases of this sort as canbe seen in Table 9 (see p. 43ff.).

    The following are the first ten cases of the automatic graphemic separa-tion and grapheme-phoneme mapping by the computer. The graphemes areseparated with “|”, and “||” means there is an ungraphemically representedphoneme; the word pronunciation is enclosed between “/”; and “:” pairs thephoneme with its corresponding grapheme:

    a|, /ah/, a:ah|a|b||l|er|, /ey b ah l er/, a:ey|b:b ah|l:l|er: er|a|b||le|, /ey b ah l/, a:ey|b:b ah|le:l|a|b|a|ck|, /ah b ae k/, a:ah|b:b|a:ae|ck:k|a|b|a|n|d|o|n|, /ah b ae n d ah n/, a:ah|b:b|a:ae|n:n|d:d|o:ah|n:n|a|b|a|n|d|o|n|ed|, /ah b ae n d ah n d/, a:ah|b:b|a:ae|n:n|d:d|o:ah|n:n|ed:d|a|b|a|n|d|o|n|i|ng|, /ah b ae n d ah n ih ng/,

    a:ah|b:b|a:ae|n:n|d:d|o:ah|n:n|i:ih|ng:ng|a|b|a|n|d|o|n|m|e|n|t|, /ah b ae n d ah n m ah n t/,

    a:ah|b:b|a:ae|n:n|d:d|o:ah|n:n|m:m|e:ah|n:n|t:t|a|b|a|t|e|d|, /ah b ey t ih d/, a:ah|b:b|a:ey|t:t|e:ih|d:d|a|b|d|a|ll|ah|, /ae b d ae l ah/, a:ae|b:b|d:d|a:ae|ll:l|ah:ah|All representations of phonemes by graphemes are shown in Table 9 (see

    p. 43ff.). Here “/.../“ symbolizes a phoneme,“” a grapheme, while “∈”means that the given phoneme is part of the grapheme cluster. The conditionunder which the given phoneme – usually a vowel – is placed (uttered) behinda grapheme is symbolized by a superscript “”, e.g. /ah/ * meansthat in some occasion /ah/ can be pronounced within , e.g.  →/ey b ah l/. Though /ah/ is not overtly represented,   is considered its

    representation. There is the possibility of simply ignoring the zero graphemerepresentation, but we decided for the above alternative.There are 289 different graphemic representations; many of them are used

    for different phonemes. The graph connecting the phonemes with graphemesis a bipartite graph which, because of its extent, cannot be presented here.

    3 Uncertainty

    The first impression of Table 9 (p. 43) is that each phoneme has multiplerepresentations. One would tend to say that there is a very weak connectionbetween the phonemes and graphemes, and that phonemes are represented

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    39/183

    Graphemic representation of English phonemes   29

    by combinations of Latin letters which contain mere phonetic orientationsbut nothing more. The situation can unreservedly be matched with Accadian

    writing, in which cuneiform symbols of different directions and sizes arecombined, or, still better, with Chinese script containing always a phoneticguide. Hence, English writing resembles and is developing into a kind of lin-ear hieroglyphic or logographic script. The extent of this development can benumerically expressed in different ways. Here we shall show only some veryelementary methods.

    3.1 Unweighted uncertainty

    The variation or diversification of the way of representing graphically a pho-neme can be called in general  uncertainty. In the case that the individualrepresentations are not weighted, uncertainty in information theory is definedas the dyadic logarithm of the number of representations, i.e.

     H 0 = log2 K    (1)

    where K  is the number of representations. Consider e.g. the phoneme  /AA/having 19 different representations. Its uncertainty can be characterized as H 0(/ AA/) = log2 19 = 4.25. There is no maximum of  H 0 because K  is poten-tially infinite but its minimum is 0. The greater  H 0, the more diversified thephoneme representation. The results for all phonemes are presented in thesecond and third column of Table 2. It can be seen easily that vowels havemore diversified representations than consonants, though some of them areweakly diversified (/AE/,  /AW/,  /OY/). Under other conditions of samplingand interpretation, one would get another picture, but any version is merelyan approximation because of the diversity of English. There is no trend, e.g.for normality of distribution of  K  or H 0  because the graphemic representa-tions did not arise by chance but by historical development and adaptation of foreign words.

    Seeing the phoneme-grapheme relations as a bipartite graph, the numberof graphemic representations is, as a matter of fact, the degree of a vertex(phoneme). Thus uncertainty, diversification and vertex degree are in this casesynonymous.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    40/183

    30   Fan Fengxiang and Gabriel Altmann

    3.2 Weighted uncertainty

    Even if a phoneme has a great number of representations (K ), not all of themare of the same importance. Their relevance is weighted by their frequencyof occurrence. This can be of two sorts: one based on the dictionary and theother based on texts. If one of the representations occurs 1 000 times, it issurely more relevant than one occurring only once. Hence, another measureof uncertainty is the entropy of first order taking into account the relative fre-quencies of individual representations. Usually one uses the Shannon entropydefined as

     H 1   = −K 

    ∑i=1

     pi log2 pi   (2)

    where K  is the number of representations and we estimate  pi by   f i/ N , where f i is the absolute frequency, N  being the number of occurrences of represen-tations of the given phoneme ( N  = Σ f i). The more concentrated the frequen-cies, the smaller the uncertainty H 1. In order to illustrate the computation, weuse the representations of /TH/, where there are  N  = 741 cases distributed

    to graphemic variants in proportions: 1, 736, 3, 1. Since formula (2) can berewritten as

     H 1   =   log2 N − 1 N 

    ∑i=1

     f i log2 f i   (2a)

    we obtain

     H 1 =  log2

    741−

    (1/741)[1log2

    1 + 736log2

    736 + 3log2

    3 + 1log2

    1]

    = 0.0676 .

    Though there are four graphemic variants, the uncertainty is very low,because one of them,  , occurs in the great majority of cases. Conse-quently, even if  K  or H 0 are great, H 1 yields a more adequate picture of un-certainty/diversity. The results are presented in the fourth column of Table 2.

     H 0 and  H 1 are characteristics of uncertainty/diversity. The first shows theraw diversity, the second the exploitation of this diversity. Theoretically theycould be independent but it can be shown that there is a tendential dependenceof  H 1 on  H 0. The t  and the F -tests show that the curves H 1 = aH b0 and  H 1 =aebx are adequate though the determination coefficient is not high enough.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    41/183

    Graphemic representation of English phonemes   31

    Table 2: Uncertainties of individual phonemes

    Phoneme   K H 0   H 1   R N 

    /AA/   19 4.2479 1.2742 0.4956 3989/AE/   7 2.8074 0.0560 0.9898 4884/AH/   60 5.9069 3.1340 0.1544 19983/AO/   19 4.2479 1.8927 0.4166 2411/AW/   7 2.8074 1.2608 0.5057 751/AY/   16 3.9999 1.0467 0.6783 2617/EH/   18 4.1699 0.9189 0.7185 5884/ER/   29 4.8580 2.0685 0.4144 6384/EY/   20 4.3219 1.3999 0.5644 3922/IH/   22 4.4594 1.0632 0.6196 13664/IY/   21 4.3923 2.4710 0.2265 7095/OW/   19 4.2479 1.0851 0.6735 2812/OY/   6 2.5850 1.1616 0.4963 306/UH/   13 3.7004 2.0327 0.3115 535/UW/   31 4.9542 2.7422 0.2278 2091/B/   5 2.3219 0.3049 0.9158 3790/P/   8 3.0000 0.4945 0.8418 5999/M/   9 3.1699 0.5761 0.8319 6379

    /F/   12 3.5850 1.0432 0.6553 3370/V/   5 2.3219 0.7787 0.6592 2711/W/   8 3.0000 1.1800 0.5485 1738/D/   5 2.3219 0.7994 0.7292 9082/T/   16 4.0000 0.9753 0.7255 13740/N/   12 3.5850 0.4592 0.8764 14163/TH/   4 2.0000 0.0676 0.9866 741/DH/   2 1.0000 0.3868 0.8601 185/S/   18 4.1699 1.5637 0.5308 12662

    /Z/   16 4.0000 1.2667 0.6300 6193/R/   9 3.1699 0.5337 0.8521 10798/L/   7 2.8074 0.9450 0.6609 11279/JH/   7 2.8074 1.5289 0.4649 3946/CH/   13 3.7004 1.5872 0.4130 1116/SH/   10 3.3219 1.8879 0.3572 2539/ZH/   8 3.0000 1.2482 0.6181 184/Y/   17 4.0874 1.5770 0.5105 1060/K/   21 4.3923 1.9685 0.4185 9163

    /G/   8 3.0000 0.9152 0.7339 2380/NG/   6 2.5850 0.5884 0.7692 3370/HH/   3 1.5850 0.0966 0.9777 1510

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    42/183

    32   Fan Fengxiang and Gabriel Altmann

    3.3 Concentration

    Another way of characterizing the diversity is the Herfindahl measure of con-centration called repeat rate in linguistics. It is defined as the sum of squaresof the probability of graphemic representations,

     R =K 

    ∑i=1

     p2i   .   (3)

    The probability is estimated by relative frequency, i.e. pi =  f i/ N , which gives

     R =  1 N 2

    ∑i=1

     f 2i   .   (3a)

    Here K and N  are different for each phoneme. This index shows the con-centration of graphemic representatives. If all frequencies are concentrated inone grapheme, then  R = 1. If all frequencies are equal, i.e. the diversity ismaximal, it attains the value R = 1/K . From the geometrical point of view,(3) represents the Euclidean distance in a K -dimensional space, i.e. R  is the

    coordinate of the phoneme. For example with  /AE/ there are 7 graphemesbut the frequencies are concentrated on , thus R = 0.9898. The smallestconcentration (the greatest dispersion) is with  /IY/ having R  = 0.2265. It ispossible to norm R in order to restrict it to the interval , but we leavein its original form. All results are presented in the fifth column of Table 2.

    4 Grapheme length distribution

    In a situation similar to English (i.e. where the number of phonemes is greaterthan the number of Latin characters), we expect graphemes of different lengthor modified graphemes like in Slavic languages. Since there are 26 Latin let-ters and the number of English phonemes is greater, there must be at leastsome graphemes consisting of two letters. However, borrowings from otherlanguages automatically amplify the number of graphemes, and since Englishhas other phonemes than the languages of origin, the exploitation of existinggraphemes diversifies, i.e. some graphemes are polyphonic (used for differ-ent phonemes) and others synophonic (different graphemes used for the samephoneme). In Table 9 we see that the phoneme /AA/ can be represented by 19

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    43/183

    Graphemic representation of English phonemes   33

    synophonemic graphemes, and the polyphonemic grapheme  can repre-sent four phonemes. However, there is no absolute arbitrariness in assigning

    a grapheme to a phoneme, or lengthening a grapheme by adding more let-ters to it. If there were no restrictions to length, it would develop randomlyaccording to a Poisson process. Now, the Poisson distribution  P x = e−aa x/ x!( x = 0, 1, . . .) can be represented by the recurrence formula

    P x   =  a

     xP x−1   (4)

    and its shape is determined by the parameter a. For a  1 it is bell-shaped. The

    greater a, the longer the tail of the distribution. Evidently,  a  must be greaterthan 1 because the extent of phonetic changes, borrowing and graphemic con-servatism in English produces a great number of graphemes. The first step incoping with this proliferation would lead to the exploitation of two-letter gra-phemes, but not each grapheme can be used to represent each phoneme. Sometwo-letter graphemes are not allowed. Hence some three-letter graphemesmust be applied, etc. However, phonetic reasons are not the only causes re-stricting the proliferation of grapheme length. It is above all the requirement

    of economy (or optimality) which cares for balance in all domains of lan-guage (cf. Zipf 1935, 1949; Köhler 1986). Thus in formula (4) a more rapidconvergence must be built in. Tentatively we replace the proportionality func-tion a/ x by the Zipfian function a/ xb and obtain

    P x =  a

     xbP x−1.   (5)

    Solving (5) we obtain

    P x =   a x( x!)b

    P0,   x = 0, 1, 2, . . . ,   (6)

    representing the Conway-Maxwell-Poisson distribution (cf. Wimmer & Alt-mann 1999) already used in linguistics, being a special case of the Wim-mer-Altmann (2005) approach. Since there is no “zero-length” grapheme, weeither solve (5) for  x = 2, 3, . . . or we displace (6) one step to the right, inorder to obtain

    P x =

      a x−1

    [( x−1)!]bT  ,   x = 1, 2, 3, . . . ,   (7)where T  =

    ∑ j=0

    a j

    ( j!)b , i.e. T  is identical with (P0)−1.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    44/183

    34   Fan Fengxiang and Gabriel Altmann

    Table 3: Grapheme length distribution and fitting the Conway-Maxwell-Poisson dis-tribution (7)

     x f  x   NP x

    1 26 25.182 159 162.863 94 88.984 8 11.455 2 0.54

    a = 6.4689, b = 3.5656χ2 = 0.73, DF  = 1, P = 0.39

    Applying (7) to our data we obtain the observed and computed values asgiven in Table 3.

    The normalizing constant is T  = 11.4796. The result is presented graph-ically in Figure 1. Parameter a  can be interpreted as the element of random-ness (speaker creativity), conservatism of orthography, borrowing etc., whileb means the braking mechanism, the balancing force of economy.

    1 2 3 4 50

    50

    100

    150

    200

    f(x)

    NP(x)

    Figure 1: Grapheme length distribution: Conway-Maxwell-Poisson distribution (7)

    5 Polyphonemics and synophonemics of graphemes

    In Table 9 we see that a grapheme can represent several phonemes. Let us callthis property graphemic polyphonemics. The great majority of graphemes is,

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    45/183

    Graphemic representation of English phonemes   35

    of course, monophonemic: it can be ascribed only to one phoneme. We dis-tinguish direct ascription of a grapheme to a phoneme from the fact that a

    phoneme is part of a grapheme. E.g. /k/ is directly represented by  (asin “excel”) which differs from representing  /k/  as part of ∈ (as in “af-fix”). Thus  as a framing grapheme can be ascribed to 6 phonemes andas a direct grapheme to 2 phonemes, i.e. it can be representative in 8 cases.The results of counting can be found in Table 4 ( x = number of phonemes rep-resented by a grapheme;   f  x = number of graphemes with  x representations; NP x  = computed number of graphemes with  x  representations). The num-bers in the table are to be read as follows: There are 191 graphemes, each of 

    which represents exactly 1 phoneme; there are 43 graphemes, each of whichrepresents exactly 2 phonemes, etc.In order to set up a model, we simply start from the Zipfian assumption of 

    setting (relative) frequency proportional to the frequency class using directlythe function from (5), namely

    P x = K 

     xb,   x = 1, 2, 3, . . . ,   (8)

    where K  is the proportionality constant having the function of the normaliz-ing constant, since we use (8) as a probability distribution. The parameter  bis, again, a control parameter braking over-strong polyphonemy. Formula (8)is usually called Zipf’s law or zeta distribution. Now, theoretically (8) has aninfinite support, which is a nonrealistic situation. For our purposes it will betruncated after  x = 10 because no grapheme represents more than 10 pho-nemes (up to now or in this variant of English). Hence we obtain

    P x =

     K 

     xb ,   x = 1, 2, . . . , R,   (9) R being the truncation parameter (here 10). Applying (9) to the data in Table 4we obtain the result in its third column. The graphic display is in Figure 2.

    The fit is excellent but it can be made still simpler. In the last row of Table 4 we see that the value of parameter a is approximately 2. Replacinga = 2 in (8) we obtain the so-called Lotka distribution

    P x =  6

     x2π2,   x = 1, 2, 3, . . . ,   (10)

    called also the ergodic distribution of population size (cf. Wimmer & Alt-mann 1999: 394). However, truncating it at the right side we obtain another

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    46/183

    36   Fan Fengxiang and Gabriel Altmann

    1 2 3 4 5 6 7 8 9 100

    50

    100

    150

    200

    f(x)

    NP(x): (9)

    NP(x): (11)

    Figure 2: Polyphonemics of English graphemes: right truncated zeta (9)

    normalizing constant

    P x =  1

     x2[π2/6−Ψ( R + 1)] ,   x = 1, 2, 3, . . . , R,   (11)

    where Ψ(.) is the trigamma function (the normalizing constant is simply the

    sum of 1/ x2 in the given definition domain). Using the right truncated Lotkadistribution (11) we obtain the results in the last column of Table 4. The resultof fitting is slightly better because we have one degree of freedom more. Butmore important is the fixed parameter value.

    Table 4: Graphemic polysemics in English

     x f  x   NP x (9)   NP x (11)

    1 191 187.84 186.482 43 46.37 46.623 21 20.45 20.724 9 11.44 11.655 9 7.29 7.466 6 5.05 5.187 5 3.70 3.818 1 2.82 2.919 2 2.23 2.30

    10 2 1.80 1.86a = 6.4689, R = 10   R = 10

     X 2 = 3.09, DF  = 7, P = 0.88   X 2 = 3.12, DF  = 8, P = 0.93

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    47/183

    Graphemic representation of English phonemes   37

    The graphemic synophonemics considers simply the numbers of graphemesrepresenting an individual phoneme. Using Table 9 we get the following num-

    bers in decreasing order:SS*60, 31, 29, 22, 21, 21, 20, 19, 19, 19, 18, 17, 17, 16, 16, 16, 13, 13,12, 12, 10, 9, 9, 8, 8, 8, 8, 8, 7, 7, 7, 6, 6, 5, 5, 5, 4, 3, 2.*

    As can easily be seen, the frequencies of individual representations arerather uniformly distributed; they do not display the same pattern as graphem-ically “simpler” languages. The only possibility of searching for order is toconsider their ranks as the independent variable. In that case, we obtain thedata given in the first two columns of Table 5.

    Table 5: Rank-frequency distribution of English graphemic synophones

    Rank x f  x   NP x   Rank x f  x   NP x

    1 60 54.46 21 10 10.842 31 36.11 22 9 10.413 29 29.80 23 9 9.994 22 26.21 24 8 9.585 21 23.75 25 8 9.18

    6 21 21.89 26 8 8.777 20 20.42 27 8 8.388 19 19.19 28 8 7.989 19 18.14 29 7 7.57

    10 19 17.22 30 7 7.1711 18 16.40 31 7 6.7512 17 15.66 32 6 6.3313 17 14.98 33 6 5.8914 16 14.35 34 5 5.4215 16 13.77 35 5 4.9316 16 13.22 36 5 4.3917 13 12.70 37 4 3.7918 13 12.21 38 3 3.0719 12 11.73 39 2 2.1020 12 11.28

    K  = 2.1178, M  = 0.6708, n = 38 DF  = 35, X 2 = 5.43, P ≈ 1.00

    For ranking of linguistic units one usually uses the negative hypergeo-metric distribution or a distribution from the Lerch family containing alsothe Zipf, Zipf-Mandelbrot, zeta and other distributions (cf. Zörnig & Alt-

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    48/183

    38   Fan Fengxiang and Gabriel Altmann

    5 10 15 20 25 30 350

    10

    20

    30

    40

    50

    60

    70

    f(x)

    NP(x)

    Figure 3: Synophonemics of English graphemes, negative hypergeometric

    mann 1995; Köhler & Martináková-Rendeková 1998, Grzybek & Kelih 2003;Grzybek, Kelih, & Altmann 2004; Best 2005a,b,c). Here we adhere to thenegative hypergeometric because its fitting turned out to be the best as canbe seen in Table 5 and Figure 3, though zeta and Zipf-Mandelbrot both yieldvery satisfactory results. We use it in 1-displaced form

    P x =

      M + x x−1

      K − M + n− x

    n− x + 1

      K + n−1

    n

      ,   x = 1, 2, . . . ,n + 1.   (12)

    6 Letter participation

    The 26 letters of the Latin alphabet used in English are not exploited equallyto build graphemes. Letters which in Latin had a vocalic value are used moreoften than those having consonant value. Again, for individual letters we getexact numbers but the only possibility to capture formally the set of nominalcategories (letters) is to rank them according to their participation in gra-phemes. Since the set of letters is not too large, the best model is again the 1-displaced negative hypergeometric distribution (12). For orientation and fur-ther research we present the letters not in alphabetic but in ranked order. Theresult of computing can be seen in Table 6 and Figure 4. The fit is excellentand corroborates once more the adequacy of this model.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    49/183

    Graphemic representation of English phonemes   39

    Table 6: Letter participation in English graphemes: Fitting the 1-displaced negativehypergeometric distribution (12)

    Letter Rank x f  x   NP x(12)   Letter Rank x f  x   NP x(12)e 1 94 91.64 p 14 17 18.80u 2 54 62.92 w 15 14 17.21o 3 52 51.97 d 16 13 15.68h 4 46 45.30 m 17 12 14.21t 5 40 40.47 y 18 11 12.78a 6 36 36.66 b 19 11 11.39r 7 35 33.48 f 20 10 10.02s 8 34 30.74 z 21 9 8.66

    i 9 32 28.31 k 22 8 7.32l 10 32 26.11 x 23 6 5.98c 11 25 24.1 q 24 5 4.61g 12 25 22.22 j 25 3 3.21n 13 19 20.46 v 26 3 1.73

    K  = 2.5453, M  = 0.7095, n = 25, DF  = 22, X 2 = 6.95, P = 0.9990

    2 4 6 8 10 12 14 16 18 20 22 24 260

    10

    20

    30

    40

    50

    60

    70

    8090

    100

    f(x)

    NP(x)

    Figure 4: Letter participation in different graphemes: negative hypergeometric (12)

    7 Weighted participation

    A sightly different aspect is the evaluation of weighted participation of letters.In Section 6 we examined the participation of letters in building graphemesbut we did not take into account the polyphonemy of graphemes, which isenormous in English. For example the letter  occurs in 36 different gra-phemes but each of these graphemes can be used to represent different pho-

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    50/183

    40   Fan Fengxiang and Gabriel Altmann

    Table 7: Ranked weighted participation of letter in graphemes

    Letter Rank x f  x   NP x(12)   Letter Rank x f  x   NP x(12)

    e 1 169 169.89 w 14 21 27.20u 2 110 112.03 p 15 20 24.44o 3 100 90.44 n 16 20 21.85h 4 86 77.36 d 17 17 19.41a 5 85 67.95 z 18 14 17.09i 6 67 60.56 m 19 13 14.90t 7 48 54.45 x 20 13 12.82s 8 46 49.21 b 21 11 10.85r 9 44 44.60 f 22 11 8.97

    l 10 36 40.48 k 23 8 7.20c 11 35 36.75 j 24 6 5.52g 12 33 33.32 q 25 5 3.94y 13 23 30.15 v 26 4 3.62

    K  = 2.8339, M  = 0.6885, n = 26, DF  = 22, X 2 = 14.64, P = 0.88

    nemes, i.e. it can be polyphonemic. In this section we consider all occurrences

    of individual letters in graphemes, i.e. we compute the weighted participationof letters. The results are presented in Table 7. Again, the number of casesis too small (26) and, since almost all letters have different participation, nomodel could be set up. Hence we use, as above, the ranked weighted partic-ipation and as expected we obtain again the 1-displaced negative hypergeo-metric distribution. The result of fitting is graphically displayed in Figure 5.

    2 4 6 8 10 12 14 16 18 20 22 24 26

    0102030405060708090

    100110120130140150160170

    180190200

    f(x)

    NP(x)

    Figure 5: Ranked weighted participation of letters in graphemes (NHG)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    51/183

    Graphemic representation of English phonemes   41

    8 Letter utility

    The last aspect we shall analyze here is the so-called letter utility. In the previ-ous sections we considered the presence of a letter in different graphemes andits presence in all graphemes; here we consider its position in the grapheme.We take into account only different graphemes and ignore their polyphone-mics. The position of a letter in a grapheme is a measure of its relevance. Theearlier the letter appears in the cluster the more it contributes to its phoneticvalue. At least this can be assumed to hold in general (although it does nothold in each case). This is especially well expressed in French where the mor-

    phology represented by inflections is dying out and the first letter of a longgrapheme is decisive for the phonetic form, e.g.   → /parl/. Let usillustrate the problem using the graphemes containing , namely

    .

    Let nq  be the set of graphemes containing   and |n| the cardinalnumber of this set. Let w x be the weight of the letter given by its positionin the grapheme. We define first

    PP =   ∑ x∈n

    w x   (13)

    as the sum of all weights (positions) of   in the graphemes of the set  n x.For the letter  we obtain from the above example nq = 5 and

    PP = 2 + 2 + 1 + 1 + 1 = 7.

    For comparative purposes we define

    PP =  1|n|   ∑ x∈n

    w x.   (14)

    In our example we obtain PP = 7/5 = 1.4. The results for all lettersare given in Table 8, #G denoting the number of graphemes and MLU  meanletter utility.

    Ordering the letters according to their utility in graphemes we obtain theorder: . Itwould be possible to order the letters according to their absolute weight, too.

    The mean utility of all letters can be expressed as the ratio of the sum of the second column of Table 8 to the sum of the third column, i.e. 1231/687 =

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    52/183

    42   Fan Fengxiang and Gabriel Altmann

    Table 8: Letter utility in English graphemes

    Letter Weight #G   MLU    Letter Weight #G   MLU 

    a 59 38 1.5526 n 31 21 1.4762b 17 13 1.3077 o 81 54 1.5000c 40 27 1.4818 p 26 19 1.3684d 19 15 1.2667 q 7 5 1.4000e 201 97 2.0722 r 103 45 2.2889f 17 13 1.3077 s 75 39 1.9231g 44 27 1.6296 t 69 41 1.6829h 100 45 2.2222 u 91 47 1.9362i 57 36 1.5833 v 4 3 1.3333

     j 5 4 1.2500 w 24 14 1.7143k 10 8 1.2500 x 10 5 2.0000l 87 37 2.3514 y 19 11 1.7272

    m 21 14 1.5000 z 14 9 1.5555

    1.7918, or as the ratio of the sum of the second column to the number of letters, i.e. 1231/26 = 47.35. Comparing this last number with Italian wherethe ratio is 6

    .48, (cf. Bernhard & Altmann, this volume, pp. 13ff.), one sees

    that the diversification of graphemics in English is enormous.

    9 Conclusions

    English graphemics is a very complex matter. Some of the measures (indices)introduced here make it evident. They differ drastically from those in otherlanguages. The loss of a unique phonetic value of a letter reduces letters tomerely graphical signs obtaining a phonetic value only in a grapheme. Theway to hieroglyphism is open.

    All indices introduced here can be analyzed further statistically. They havetheir sampling distributions, asymptotic tests can be set up, languages can beclassified according to their graphemics, and there is a possibility to find in-terrelations among all these properties and also between graphemic and non-graphemic properties. The last aim is, of course, to find laws of graphemicsand join them in a system of laws, i.e. in a theory. At present, any such enter-prise would be premature.

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    53/183

    Graphemic representation of English phonemes   43

    Table 9: English phoneme-grapheme correspondences

    Phoneme Graphemes Frequency Examples

     /AA/ 1412 (a)bo 3 baz(aa)r 14 y(ah) 18 (al)mond 1 arkans(as) 1 baccar(at) 31 astron(au)t 6 (aw)ful 16 s(er)geant 20 wholeh(ea)rtedly 2 bur(eau)cracy 1 exh(au)stively 15 (ho)nors 1 l(i)ngerie 2427 abd(o)minal 7 j(oh)n 2 s(ol)der

    4 c(ou)gh 8 ackn(ow)ledgement

     /AE/ 4859 ab(a)ck, zigz(a)gging 2 g(ah)n, p(ah) 1 pl(ai)d 11 beh(al)f, s(al)mon 9 (au)nt, l(au)ghter 1 y(eah) 1 chop(i)n

     /AH/ 5028 (a)bide, mad(a)m 1 is(aa)c 7 an(ae)sthesia, minuti(ae) 20 abdall(ah), tor(ah) 36 barg(ai)n, vill(ai)ns 10 (au)gusta, nickl(au)s

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    54/183

    44   Fan Fengxiang and Gabriel Altmann

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    3619 abandonm(e)nt, zab(e)l 24 chang(ea)bl, veng(ea)nce 2 bur(eau)crat, bur(eau)crats 7 for(ei)gn, surf(ei)t 13 bludg(eo)n, surg(eo)ns 7 advantag(eou)s, right(eou)sness 2 paraph(er)nalia, res(er)voir 9 budd(ha), wind(ha)m

    3 ve(he)mence, ve(he)mently 8 anni(hi)lation, pro(hi)bition 2379 abdom(i)nal, zoolog(i)st 185 acac(ia), venet(ia)n 36 anc(ie)nt, trans(ie)nt 1349 abduct(io)n, volit(io)n 66 ambit(iou)s, vivac(iou)s 1 belg(iu)m 2680 aband(o)n, zool(o)gy 1 mendelss(oh)n 5 conn(oi)sseur, tort(oi)se 2 linc(ol)n, norf(ol)k 1 m(on)sieur 17 bl(oo)d, fl(oo)d 317 adulter(ou)s, zeal(ou)s 1 mccull(ough) 2864 abd(u)ction, y(u)m 1 etiq(ue)tte 3 br(uh)n, (uh)

    5 bisc(ui)t, circ(uit)s 53 anal(y)ses, vin(y)l∈ 414 able, ambling∈ 32 babbled, bubbling∈ 5 subtler, subtly∈ 21 article∈ 20 buckled∈ 43 beadles, idling, haydn

    ∈ 31 addle, huddling

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    55/183

    Graphemic representation of English phonemes   45

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    ∈ 7 rifle∈ 13 baffle∈ 24 bedraggled∈ 12 ankle∈ 10 mcalister∈ 9 one∈

    42 ample∈ 18 apple∈ 146 activism∈ 3 muscle∈ 1 tussle∈ 25 apostle∈ 19 beetles∈ 44 battle∈ 5 logarithm, algorithm∈ 251 accum(u)lated∈ 2 axle

    ∈ 24 dazzled

     /AO/ 344 (a)lbany, y(a)lta 1 ut(ah) 24 b(al)ked, w(al)kways 1 extr(ao)rdinary 257 appl(au)d, v(au)lts 1 v(augha)n 132 (aw), y(aw)ning 3 (awe)some, dr(awe)rs 1 s(ea)n

    5 g(eo)rgia, g(eo)rgetown 7 ex(hau)st, inex(hau)stible 2 ex(ho)rtations, ex(ho)rting 2 dig(io)rgio, g(io)rgio 1487 abh(o)rrent, zl(o)tys 54 ab(oa)rd, washb(oa)rd 16 d(oo)r, outd(oo)rs 1 f(ore)runner 72 betanc(ou)rt, y(ou)r

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    56/183

    46   Fan Fengxiang and Gabriel Altmann

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    1 unt(owa)rd /AW/ 6 l(ao), t(ao)ists

    32 aden(au)er, t(au)ssig 3 (hou)r, (hou)rs 486 ab(ou)nd, whereab(ou)ts 3 b(ough), h(ough)s 219 all(ow), y(ow) 2 d(owe)r, h(owe)

     /AY/ 1 m(ae)stro 21 al(ai), th(ai)land 6 b(ay)ou, sant(ay)ana 2 (aye), (aye)s 80 alam(ei)n, z(ei)tler 5 ch(ey)enne, m(ey)ers 10 bug(eye)d, (eye)witness 2293 ab(i)des, wr(i)ting 2 d(ia)mond, d(ia)monds 14 l(ie), unt(ie) 22 h(igh), th(igh) 2 c(oy)ote, c(oy)otes 6 beg(ui)led, disg(ui)se 6 b(uy), sch(uy)ler 342 acol(y)te, wr(y)ly 5 b(ye), r(ye)

     /EH/ 494 actu(a)rial, y(a)rrow 12 (ae)rial, kr(ae)mer 120 ad(ai)r, volt(ai)re

    4 pr(ay)er, s(ay)s 4955 ab(e)d, z(e)st 257 abr(ea)st, z(ea)lous 1 k(ee)lson 2 g(eh)rig, k(eh)le 4 l(ei)sure, th(ei)rs 5 j(eo)pardizing, l(eo)pards 2 int(er)rogation, int(er)rogator 1 pirou(ette)

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    57/183

    Graphemic representation of English phonemes   47

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    1 r(ey)nolds 3 (hei)r, (hei)rs 9 ch(ie)n, unfr(ie)ndly 1 (oe)dipal 11 b(u)ry, woodb(u)ry 2 marq(ue)tte, velasq(ue)z

     /ER/ 1 an(aer)obic 1 lecl(air)

    389 afterw(ar)d, wiz(ar)d 27 (arr)anges, re(arr)anged 4 (aur)ora, rest(aur)ateur 63 ath(ear)n, y(ear)nings 3972 abl(er), zurch(er) 1 w(ere) 30 ab(err)ation, unint(err)upted 14 amat(eur), restaurat(eur) 4 (her)b, shep(her)ds 2 plag(iar)ism, tert(iar)y 8 croz(ier), sold(ier)s 211 ad(ir)ondack, wh(ir)lwind 3 chesh(ire), staffordsh(ire) 4 st(irr)ed, wh(irr)ing 3 cupb(oar)d, starb(oar)d 2 c(olo)nel, c(olo)nels 803 ab(or)iginal, wh(or)ls 30 c(orr)al, w(orr)ying 50 adj(our)ned, y(our)self  

    58 ac(re), wi(re)s 5 i(ro)n, i(ro)nside 485 abs(ur)d, z(ur)cher 144 advent(ure), vult(ure) 59 bl(urr)ed, unh(urr)ied 6 b(yr)d, mart(yr)s 1 m(yrrh)∈ 4 figure

     /EY/ 2886 (a)bler, z(a)bel

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    58/183

    48   Fan Fengxiang and Gabriel Altmann

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    8 br(ae), vertebr(ae) 496 abig(ai)l, whitet(ai)l 2 g(au)ge, g(au)ged 311 alw(ay)s, yesterd(ay)s 1 m(aye) 66 alfr(e)do, y(e)hudi 33 beefst(ea)k, y(ea)ts 3 b(ee)thoven, soir(ee)

    2 l(eh)mann, n(eh)ru 48 alex(ei), w(ei)ghty 11 n(eigh)bor, w(eigh)s 1 bouvi(er) 1 d(es)cartes 11 ball(et), val(et) 1 ricoch(ete)d 37 ab(ey)ance, th(ey) 1 linger(ie) 2 communiq(ue)s, enriq(ue)∈ 1 b(ue)no

     /IH/ 97 acre(a)ge, yard(a)ge 1 (ae)gean 1 barg(ai)ning 2669 abat(e)d, z(e)ros 101 alv(ea)r, y(ea)rs 55 auction(ee)r, volunt(ee)rs 7 counterf(ei)t, w(ei)rdly 1 cretac(eou)s

    1 c(ey)lon 1 rend(ez)vous 2 hemorr(ha)ging, hemorr(ha)ge 6 budd(hi)sm, ex(hi)bits 10415 abandon(i)ng, zur(i)ch 7 carr(ia)ges, marr(ia)ge 25 b(ie)rce, s(ie)ve 4 feroc(iou)sly, malic(iou)sly 1 w(o)men

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    59/183

    Graphemic representation of English phonemes   49

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    12 bacch(u)s, ponti(u)s 1 racq(ue)t 18 b(ui)ld, shipb(ui)lding 1 green(wi)ch 238 ab(y)smal, (y)vette

     /IY/ 7 alg(ae), p(ae)an 13 ass(ay), wednesd(ay) 1222 abil(e)ne, zo(e)

    611 agl(ea)m, z(ea)lously 525 absent(ee), yank(ee)s 1 l(eh)man 53 b(ei)n, w(ei)r 1 l(eigh) 3 p(eo)pled, p(eo)ple 193 abb(ey), yanc(ey) 1 diarr(he)a 1 del(hi) 1622 abilit(i)es, compan(i)es 281 p(ie)ce, zomb(ie) 1 cast(ill)o 3 chabl(is), debr(is) 1 pet(it) 2 ph(oe)nix, subp(oe)na 2 mosq(ui)to, mosq(ui)toes 1 marq(uis) 2551 abernath(y), zoolog(y)

     /OW/ 1 dav(ao)

    7 ch(au)ffeur, s(au)ternes 4 b(eau)jolai, tabl(eau) 1 b(eaux) 1 peug(eot) 1 s(eou)l 4 s(ew), s(ew)n 2287 abd(o)men, z(o)ology 177 afl(oa)t, wh(oa) 30 b(oe)ing, w(oe)fully

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    60/183

    50   Fan Fengxiang and Gabriel Altmann

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    2 b(oh)len, c(oh)en 7 c(ol)mer, y(ol)k 2 r(oo)sevelt, r(oo)sevelts 1 aprop(os) 3 dep(ot), p(ot)pourri 16 b(ou)lder, sh(ou)lders 15 alth(ough), th(ough) 249 arr(ow), yell(ow)ish

    4 marl(owe), st(owe) /OY/ 4 bayr(eu)th, r(eu)ther 1 hemorr(hoi)ds 181 adr(oi)t, v(oi)ds 1 iroqu(ois) 117 all(oy), v(oy)age 2 b(uoy)ancy, b(uoy)ant

     /UH/ 6 n(eu)ral, n(eu)rotic 1 post(hu)mous 9 b(o)som, w(o)manhood 224 adulth(oo)d, yearb(oo)k 1 w(or)cester 24 bonj(ou)r, y(ou)rselves 3 c(oul)d, w(oul)d 178 acap(u)lco, z(u)rich 1 nieb(uh)r 3 fl(uo)rescent, fl(uo)rine∈ 3 (eu)rasian

    ∈ 1 milieu

    ∈ 81 brav(u)ra /UW/ 30 br(eu)er, z(eu)s

    130 andr(ew), withdr(ew) 2 sil(hou)etted, sil(hou)ette 4 ad(ieu), l(ieu)tenants 79 ad(o), wrongd(o)ing 7 can(oe), sh(oe)string 356 aftern(oo)n, z(oo)ms 2 p(ooh), p(ooh)ed

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    61/183

    Graphemic representation of English phonemes   51

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    73 ac(ou)stic, y(ou)ths 1 den(oue)ment 5 breakthr(ough), thr(ough)put 2 c(oup), c(oup)s 1 rendezv(ous) 1 led(oux) 818 absol(u)tes, y(u)goslavia 58 accr(ue)d, virt(ue)s

    1 k(uh)n 46 br(ui)ses, uns(ui)ted 2 b(uo)yed, b(uo)ys 3 t(wo), t(wo)some∈ 4 beautiful∈ 18 eucalyptus∈ 1 ewe∈ 7 interview∈ 1 houston

    ∈ 405 acc(u)mulated

    ∈ 26 arg(ue)∈ 2 h(ugh)∈ 1 (hu)hes∈ 3 deb(ut)∈ 2 vac(uu)m

     /B/ 3625 a(b)ack, zom(b)ie 109 a(bb)as, we(bb)er 52 ascri(be), wardro(be) 1 la(bo)ratories

    3 cam(pb)ell, cu(pb)oards /P/ 1 ha(b)sburg

    1 su(bp)oena

    5489 abru(p)t, zi(p) 1 princi(pa)lly 114 antelo(pe), wi(pe) 2 u(ph)olstered, u(ph)olstery 390 agri(pp)a, zi(pp)er 1 bankru(pt)cy

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    62/183

    52   Fan Fengxiang and Gabriel Altmann

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

     /M/ 2 diaphra(gm), paradi(gm) 5807 abandon(m)ent, zoo(m)s 46 aplo(mb), whitco(mb) 220 afla(me), wholeso(me) 283 acco(mm)odated, zi(mm)erman 1 fe(mme) 16 autu(mn), sole(mn)ly 1 te(mp)tation

    3 gover(nm)ent, gover(nm)entally /F/ 2694 adol(f), yoursel(f) 26 cha(fe), wildli(fe) 294 a(ff)able, ze(ff)irelli 1 jolli(ffe) 1 uncom(for)tably 8 o(ft)en, so(ft)ens 2 aw(fu)lly, power(fu)lly 27 cou(gh), trou(gh)s 1 (pf)ennig 312 al(ph)a, xeno(ph)obia 2 so(pho)more, so(pho)mores 2 gusta(v), moloto(v)

     /V/ 2 o(f), thereo(f) 3 ste(ph)en, ste(ph)enson 2123 abbre(v)iated, y(v)ette 581 aborti(ve), wo(ve) 2 re(vv)ed, sa(vv)y

     /W/ 4 (ju)an, ti(ju)ana

    1 biv(ou)ac 381 acq(u)aint, venez(u)elan 1224 after(w)ard, wrist(w)atch 116 any(wh)ere, (wh)y∈ 9 (o)ne 2 ch(o)ir∈ 1 b(ue)no

     /D/ 7688 aban(d)on, zealan(d) 131 a(dd), yi(dd)ish

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    63/183

    Graphemic representation of English phonemes   53

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    293 abi(de), worldwi(de) 967 abandon(ed), kill(ed) 3 (t)aoism, (t)aoists

     /T/ 10 de(bt), undou(bt)edly 5 ya(cht), ya(cht)sman 4 conne(ct)icut, indi(ct)ments 154 acquiesce(d), seduce(d) 7 bernhar(dt), schmi(dt)

    436 abolish(ed), zipp(ed) 219 aforethou(ght), wrou(ght) 4 (pt)olemaic, recei(pt)s 11661 abandonmen(t), zoologis(t) 693 absolu(te), wro(te) 1 descar(tes) 12 apar(th)eid, (th)omson 500 abe(tt)ed, wri(tt)en 28 antoine(tte), yve(tte)

    ∈ 5 na(z)i

    ∈ 1 pi(zz)a /N/ 2 we(dne)sday, we(dne)sdays

    65 ali(gn), vi(gn)ette 2 colo(gne), champa(gne) 58 ac(kn)owledges, un(kn)own 1 co(mp)troller 13247 aba(n)do(n), zo(n)ing 5 gra(nd)children, wi(nd)sor 453 abile(ne), zo(ne)

    316 ante(nn)a, wy(nn) 10 a(nne), wy(nne) 3 denoueme(nt), rapprocheme(nt) 1 (pn)eumonia

     /TH/ 1 sou(t)hampton 736 aberna(th)y, zeni(th) 3 bli(the)ly, wri(the) 1 ca(tho)lic

     /DH/ 171 altoge(th)er, you(th)s

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    64/183

    54   Fan Fengxiang and Gabriel Altmann

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    14 ba(the), soo(the) /S/ 1209 absen(c)es, yan(c)y

    504 abeyan(ce), when(ce) 3 glou(ces)ter, wor(ces)ter 7 fawk(es), wilk(es) 30 (ps)alm, (ps)yllium 9066 abel(s)on, zoologi(s)t 93 acquie(sc)ence, vi(sc)eral

    5 acquie(sce), coale(sce) 230 abu(se), wor(se) 939 abruptne(ss), zei(ss) 4 impa(sse), ru(sse) 27 che(st)nut, wre(st)ling 7 an(sw)er, unan(sw)ered 29 auschwit(z), walt(z)∈ 501 affi(x)∈ 2 a(xe)

    ∈ 5 na(z)i

    ∈ 1 pi(zz)a /Z/ 1 (cz)ar

    409 abiliti(es), zombi(es) 4877 abel(s), zoom(s) 203 abu(se)d, who(se) 4 bu(si)ness, bu(si)nessmen 1 unrea(so)ning 1 ra(sp)berry 17 de(ss)ert, sci(ss)ors

    1 a(sth)ma 2 clo(thes)horse, plainclo(thes) 1 (ts)ar 6 an(x)ieties, (x)enophobia 347 maga(z)ine, (z)urcher 201 abla(ze), visuali(ze) 30 bli(zz)ard, whi(zz)ing∈ 92 au(x)iliary

     /R/ 44 av(er)ages, vet(er)inary

    (continued on next page)

  • 8/17/2019 Gabriel Altmann, Fan Fengxiang (Editors)

    65/183

    Graphemic representation of English phonemes   55

    Table 9 (continued from previous page)

    Phoneme Graphemes Frequency Examples

    4 deb(or)a, satisfact(or)y 9955 abno(r)mal, zu(r)ich 379 adhe(re), yo(re) 20 go(rh)am, (rh)ythmically 318 abe(rr)ant, ya(rr)ow 1 ca(rre) 1 rappo(rt) 76 a(wr)y, (wr)yly

     /L/ 1 imbro(gl)io 9032 ab(l)er, zoo(l)ogy 890 ab(le), ya(le) 1306 abda(ll)ah, zeffire(ll)i 43 be(lle), wa(lle) 4 i(sl)and, i(sl)es 3 ai(sle), carli(sle)

     /JH/ 47 a(d)ulation, une(d)ucated 50 e(dg)y, sto(dg)y 63 acknowle(dge)d, we(dge)d 27 a(dj)acen


Recommended