Index Phonemica (IPHON) is a database of phoneme inventories and allophonic rules extracted from source documents, focusing primarily on linguistic areas underrepresented in existing databases such as PHOIBLE, WALS, and PBase. As a consequence of this focus, no recycling of existing databases is undertaken.

The current version, v0.5.1, contains 459 entries, representing data from 301 languages. These entries contain a total of 1012 distinct segments, each of which is mapped to a set of features. (This mapping currently uses a modified version of PHOIBLE's featuralization code and feature set, but this will change before v1.0.)

For the name, cf. Index Diachronica.

Using the Index

The Index can be browsed by language, doculect, or segment view, or searched with Pshrimp.

The Index makes no attempt at representative sampling of languages, and thus should not be used to establish overall statistical patterns.

Languages and doculects

The distinction between 'language' and 'doculect' is largely borrowed from PHOIBLE; however, the Index uses Glottocodes rather than ISO codes to uniquely identify languages. In the Index, a language is simply a language-level glottocode, although dialect-level glottocodes are assigned in the dialect_name field when available. Language metadata, such as language family, latitude, and longitude, are imported from Glottolog.

A doculect in the Index corresponds to a single source. As a result, there may be many different doculect entries for one language. Doculects are uniquely identified by IPHON IDs, which consist of the Glottocode of the associated language and a chronologically incremented index, separated by a hyphen. (Chronological incrementing allows for unambiguous reference to specific entries in source documents that contain multiple inventories for the same language.)

Segmental transcription

The Index uses a slightly modified version of IPA. The differences are as follows:

We use the Sinological ɿ for the 'apical vowel', as well as for the 'fricated /i/' of other languages. (See Connell 1997 and Faytak 2014.) ʮ and ʅ are also used with their Sinological meanings. We also introduce the symbol ꭒ, for a 'fricated /u/' or IPA [v̩].
We use the full set of alveolo-palatal consonant characters, including ȶ ȡ ȵ ȴ.
For typographic reasons, we use ʵ instead of the IPA rhotic hook.
Prenasalized consonants are written with preceding ⁿ. Postnasalized consonants are written with following ⁿ. Prenasalized trills are currently assumed to have a plosive element, although this will probably be changed before v1.0.
In cases where Cr or Cl sequences are given as units, we write the liquid element with a superscript. (So there's a difference between gʟ and gˡ: the former is completely velar, whereas the latter is presumably realized as a velar plosive followed by a coronal lateral.)
Breathy-voiced consonants or 'voiced aspirates' are written with following ʱ.
Super-high 66 tones are written ˥́, i.e. high tone with combining acute.
The retraction diacritic on vowels or tones is used to indicate laryngealization or tense voice, following the Tibeto-Burman convention.

Using the data

The data repository is here. To report an error in the data, file an issue.