Speech Corpus Tools: Tutorial and examples¶

Enriching databases¶

SCT supports an array of enrichments on what is imported. Typically a corpus starts off with just words and phones, but higher level information about utterances and intermediate information about syllables is useful for corpus research. In this section, there will be a pipeline that you should follow for enriching your corpus.

Non-speech elements¶

The first aspect of enrichment to run is encoding whether some annotations are not speech. These can be things like silence, coughs, laughter, etc. To encode non-speech elements:

Go to the “Enhance corpus” menu
Select the “Encode non-speech elements...” option
Replace the default regular expression if needed

The default is the regular expression for the Buckeye corpus

It matches all annotations for silence, the interviewer, laughter and other such elements

For FAVE, set it to ^sp$

For TIMIT and other force-aligned TextGrids, set it to ^<SIL>$

Press Encode and wait for it to finish

Utterances¶

The primary function of encoding non-speech elements is to use them as the boundaries of utterances. In general, we define pauses between utterances to be a non-speech element (usually silence) of greater than some duration, usually 0.15 or 0.5 seconds.

To encode utterances:

Go to the “Enhance corpus” menu
Select the “Encode utterances...” option
Replace the default values if needed

The default is set to 0 (every non speech element is a pause between utterances), change to 0.15 to encode pauses as 150 ms

Press Encode and wait for it to finish

Syllables¶

Syllables are encoded in two steps. First, the set of syllabic segments in the phonological inventory have to be specified.

To specify segments as syllablic:

Go to the “Enhance corpus” menu
Select the “Encode syllabic segments...” option
Change the default values as necessary

By default it selects segments that contain the characters i e a o u, which covers a number of machine readable/non-ipa alphabets

Press Encode and wait for it to finish

Once syllabic segments have been encoded as such, you can encode the syllables themselves. In addition, queries will allow you to filter based on phones subset being syllabic.

To do so:

Go to the “Enhance corpus” menu
Select the “Encode syllables...” option
Select the desired algorithm

At the moment only a “maximum attested onset” algorithm is implemented

This algorithm finds all the onsets at the beginnings of words

Any consonantal string between two vowels is split up in such a way that as many segments are put into the onset as possible given the attested onsets at the beginnings of words

Other algorithms will be implemented in the future

Press Encode and wait for it to finish

Hierarchical properties¶

Useful information is available once the hierarchy has been fleshed out beyond words and phones. For instance, once utterances and syllables are encoded, you can count all of the syllables in each utterance, or get the rate of them per second (a common definition of speech rate). These properties are useful to cache before queries because their calculation is time intensive, but the results do not change. An utterance, once encoded, will always have the same number of syllables in it.

To encode a hierarchical property:

Go to the “Enhance corpus” menu
Select the “Encode hierarchical properties...” option
Select the higher annotation

For speech rate, this would be utterance

For number of syllables in a word, this would be word

For a word’s position in its utterance, this would be utterance

Select the lower annotation

For both speech rate and word, this would be syllable

For a word’s position in its utterance, this would be word

Select the type of property

For speech rate, this would be rate

For number of syllables in a word, this would be count

For a word’s position in its utterance, this would be position

Enter a name for the property

The default is intended to be descriptive, but overly so

Press Encode and wait for it to finish

Enriching the lexicon¶

Often we would like to query based on properties of words gathered from outside the corpus itself. For instance, part of speech is often not encoded in corpora when they’re imported, but could be a criteria to search for or to exclude. Likewise, if a particular set of words is needed, they can be encoded with a property offline to facilitate queries later.

The format of files for enriching the lexicon requires a named column-delimited text file (CSV, tab-delimited text file, etc) with headers. The first column should be the orthography of the word, the name of the column is not used. Subsequent columns correspond to properties to be encoded, where the sanitized name of the column with used as the name of the property in the database. For instance, a column named “Frequency” with a column of numerical values will become a numeric property named “Frequency” that can be filtered on.

The words specified in the text file does not have to be exhaustive, it will set properties for each word that is found, and leave the other ones alone. If you have a specific set of words you’d like to search for, you can create a text file with the first column having the orthography, and the second column a property named “Desired” with every word having a corresponding “True” value in that column. Then you can do a search for every word that has a value of True for its Desired property.

To enrich the lexicon:

Go to the “Enhance corpus” menu
Select the “Encode lexicon...” option
If you would like to ensure case-sensitivity, press the corresponding check box.
Press “Encode” and select a text file on your computer and wait for it to finish

Enriching the phonological inventory¶

Similar to lexicons, it is often useful to enrich the phonological inventories of corpora. These can be features such as + for a feature anterior or a value of fricative for a property such as manner_of_articulation.

The format of files that are used for inventory enrichment mirrors that for lexicon enrichment. They should be column-delimited text files with headers where the first column corresponds to the segment label and subsequent columns are properties to be encoded on the segments.

Go to the “Enhance corpus” menu
Select the “Encode phonological inventory...” option
Press “Encode” and select a text file on your computer and wait for it to finish

Encode phone subsets/classes¶

You can encode some arbitrary subset of phones as a particular label, similar to how syllabic segments were encoded with the subset label of syllabic.

Go to the “Enhance corpus” menu
Select the “Encode phone subsets (classes)...” option
Enter in a label for the subset/class
Select the phones to be classified
Press Encode and wait for it to finish

Analyze acoustics¶

Acoustics (pitch and formants) can be encoded to enrich the corpus. At the moment, such encoding is only relevant for when inspecting the waveform/spectrogram, as their is currently no way to query acoustics. The encoding for acoustics will also take a while depending on the size of the sound files/corpus, so I do not recommend using this option in the current state of SCT.

Next Previous