Distributed Representations and Mixed Schemas

Catherine L. Harris Boston University


In this talk, I introduce the notion of mixed schema, which is a representation containing both a linguistic category (like noun) and an overtly occurring expression (a word, such as first). An example of a mixed schema is first+noun. Patterns which instantiate this schema include first time, first name, first place, first lady, first man to walk on the moon. I discuss how mixed schemas facilitate word and concept learning and may also underlie creative uses of language. Applied to the domain of familiar two-word combinations, a connectionist model shows how mixed schemas emerge when a system self-organizes in the course of extracting regularities in a corpus of utterances. Measurements of the representational strength of two-word patterns in the network are compared to recognition data in human subjects.

I focus on familiar two-word combinations because pattern strength can be quantified via text counts, and language users' reactions to them can be measured via recognition experiments. However, schemas of intermediary abstraction are to be expected in virtually all areas where humans acquire knowledge via regularity extraction.

Mixed Schemas, the Schematicity Continuum, and the Rule-List Fallacy

The Rule-List Fallacy (so-named by Langacker, 1987 is the assumption that there are two types of mental representations: abstract descriptions of patterns (such as the descriptions of rules in a grammar), and lists of exceptions. A phrase or sentence which fits a rule would not also be separately memorized. The division into rules and lists of exceptions was originally designed to prevent redundant encoding and achieve economic description (Chomsky, 1965). It makes less sense today with the modern recognition that the the brain has vast storage resources and may use massive redundant encodings.

Evidence against the division into rules and lists include examples where the meaning of a sentence probably has been stored as a memorized entity, yet is not an exception to rules of compositional semantics, as in She felt the baby kick.

The alternative to dividing linguistic units into generalizations over phrases and listings of phrases is the schematicity continuum (see also the proponents of construction grammar; Fillmore, 1988; Goldberg, 1992 and Harris, 1994). This is the proposal that generalizations over linguistic patterns are not limited to the maximally general level of linguistic categories, but occur for a variety of types of generalizations, and for generalizations of varying degrees of productivity.

The schematicity continuum only makes senses in a system with dynamic data structures. Data structures are dynamic when they are not a fixed part of the overall system, but are emergent, or implicit in the working of the system. For example, consider the lose+noun pattern, instantiated by lose track, lose sight, lose touch. We can identify the lose+noun generalization using statistical analysis tools on the hidden units of a simple recurrent network, but the generalization does not exist independently of the representations of its members.

The foregoing reminds us that there is nothing special or marked about mixed schemas within the schematicity continuum. The reason to focus on them is they provide a concrete case where rival theories make different predictions. For convenience I will refer to the two rival theories as the "multi-schemas" view and the "rule+list" view.

Linguistic Questions

Important questions in linguistics and language learning are what principles govern the ease of learning a generalization R and how easily R can be creatively extended. The connectionist framework used here suggests the following factors:

Experiment Predictions

Experimental participants recognize familiar word combinations (plaid skirt) more easily than merely legal combinations (green skirt), while legal combinations are recognized more easily than anomalous combinations (Harris, 1997). In work reported in this paper, these same techniques have been applied to phrases which can be assimilated to both an adjective+noun schema as well as a mixed schema, such as first+noun.

According to the multi-schemas view, speed of recognition of a phrase is influenced both by the phrase's frequency, and by its schema's frequency, where the schema can be any pattern less specific than the phrase. Consider the phrase high rule. Assuming that this phrase is unfamiliar, then its recognition will be facilitated by the strength of mixed schemas such as high+noun and also the general adjective+noun schema. According to the rule+list view, only the adjective+noun schema will be relevant.

These predictions can be tested using unfamiliar phrases which can be assimilated to either a low or high frequency schema. For example, text counts reveal fewer instances of low+noun phrases than of high+noun phrases. The multi-schema view that predicts better recognition of phrases such as high bicycle compared to low bicycle, while the rule+list view predicts no difference.

Advocates of the rule+list view could claim that high is a better adjective than low; that a feature of words is their adjectival goodness, and that adjectival goodness speeds assimilation to the adjective+noun schema. To refute these objections, phrases were collected in which the initial word could be used as either a noun or an adjective (fire town vs fire around).

Design of Simulations

Familiar word combinations were selected from Lund & Burgess (1996) and embedded in sentences generated by a phrase structure grammar, thus creating a corpus of language-like sequences. A simple recurrent network was used to predict the next word in the sentence, following Elman (1993).

To investigate learning and generalization, different training corpora were constructed in which frequency and diversity of word phrase patterns were systematically controlled. (In the initial simulations, diversity was investigated by varying the type-token ratios, although in later planned simulations, semantics will be used so that diversity will refer to conceptual domains.) Ease of learning was measured using number of training cycles until the identity and/or grammatical category of the second word of a phrase could be predicted to low error. The networks' reaction to novel phrases such as fire town and fire around were measured by amount of error generated by the second word.