Coh-Metrix version 2.0 indices

 

Table of contents

I. General overview
II. Overview of Coh-Metrix indices (output file)

III. Indices in the Coh-Metrix 2.0 output file

1. General identification and reference information

2. Readability indices

3. General word information and text information

4. Syntactic indices

5. Referential and semantic indices

6. Situation model dimensions

IV. References

V. Example Text and Output

 

 

 

I. General Overview

 

Coh-Metrix is a computational tool that produces indices of the linguistic and discourse representations of a text. These values can be used in many different ways to investigate the cohesion of the explicit text and the coherence of the mental representation of the text. Our definition of cohesion consists of characteristics of the explicit text that play some role in helping the reader mentally connect ideas in the text (Graesser, McNamara, & Louwerse, 2003). The definition of coherence is the subject of much debate. Theoretically, the coherence of a text is defined by the interaction between linguistic representations and knowledge representations. When we put the spotlight on the text, however, coherence can be defined as characteristics of the text (i.e., aspects of cohesion) that are likely to contribute to the coherence of the mental representation.  Coh-Metrix provides indices of such cohesion characteristics.

   

 

1. Preliminary information

 

This document contains a description of 60 indices incorporated into the website Coh-Metrix version 2.0. These descriptions are intended to be succinct specifications for people who want to work on this version of Coh-Metrix. More theoretical information on the Coh-Metrix indices and architecture is reported in Graesser, McNamara, Louwerse and Cai (2004) and McNamara, Louwerse and Graesser (2002). For each level of indices, example scores are provided.  However, it is important to note that even small changes in texts can lead to large changes in Coh-Metrix output. It is also important to note that the scores are often subject to the output of third party parsers, lexicons, and word frequency databases, all of which are outside of the control of Coh-Metrix.

 

2. Coh-Metrix concepts

 

Some definitions of key concepts are needed to specify the algorithms that underlie the indices in Coh-Metrix 2.0.  We define these concepts in this section.

 

Adjacent versus All sentences

 

Adjacent sentences are successive sentences in a span of text.  For example, if a span of text has 4 sentences, then the adjacent sentences would be sentences 1-2, 2-3, and 3-4. In contrast, all sentences are all possible pairs of sentences: 1-2, 2-3, 3-4, 1-3, 1-4, and 2-4. A span of text may be defined in different ways for different purposes, but there is a distinction between paragraph spans and the span of an entire document. The adjacent sentences in Coh-Metrix 2.0 ignore junctures between paragraphs. 

 

Weighted versus Unweighted distances between sentences

 

This distinction is pervasive in a more advanced version of Coh-Metrix (1.2), but not in most indices used in Coh-Metrix 2.0. When the distance between sentences is weighted, the weight between two sentences decreases the further they are apart in the text. All sentence pairs have equal weight when distances are unweighted. With rare exception, distances between sentences are unweighted in Coh-Metrix 2.0.

 

Incidence scores versus Ratio scores

 

An incidence score is the number of classified units per 1000 words. For example, the incidence score for pronouns would compute the number of words that are classified as pronouns for a span of 1000 words. It is equivalent to what some researchers call rates or density scores. In contrast, a ratio score is a relative measure that compares the incidence of one class of units to the incidence of another class of units. For example, a pronoun ratio is the incidence of pronouns divided by the incidence of noun-phrases. Ratio scores compare two different metrics (classes of units) whereas an incidence score applies to only one metric.

 

Repetition score

 

A repetition score is computed on sequences of text units that are classified into categories.  This score is the proportion of adjacent pairs of units in the sequence that are in the same category.  If there are N units in a sequence, there are (N-1) adjacent pairs.  The number of adjacent pairs in the same category is divided by N-1. For example, we have computed the repetition score for a sequence of categories A, B, and C:

 

Category sequence: A BB B C A A C C B B B B A C C

Adjacency repetition0 1 1 0 0 1 0 1 0 1 1 1 0 0 1

The repetition score for this sequence is 8 / 15.

 

 


II. Overview of Coh-Metrix 2.0 Indices

 

The table of indices in the Coh-Metrix 2.0 output file has four columns. The columns are: (1) the ordinal number of the index, (2) a short description or label for the index, (3) a coded index that has a maximum of 8 characters and that assists in reconstructing relevant distinctions and parameters in some software packages (e.g., SPSS), and (4) an expanded full description of the index. Scores for these indices will appear between the 3rd and 4th column. Separate columns will be added for each additional text. Details about an index can be accessed by clicking on its label in the Description column.

 

Table 1: Indices in the Coh-Metrix 2.0 Output File

 

No.

Description

Measure

Full description

1

Title

Title

Title

2

Genre

Genre

Genre

3

Source

Source

Source

4

JobCode

JobCode

JobCode

5

LSASpace

LSASpace

LSASpace

6

Date

Date

Date

7

Adjacent anaphor reference

CREFP1u

Anaphor reference, adjacent, unweighted

8

Anaphor reference

CREFPau

Anaphor reference, all distances, unweighted

9

Adjacent argument overlap

CREFA1u

Argument Overlap, adjacent, unweighted

10

Argument overlap

CREFAau

Argument Overlap, all distances, unweighted

11

Adjacent stem overlap

CREFS1u

Stem Overlap, adjacent, unweighted

12

Stem overlap

CREFSau

Stem Overlap, all distances, unweighted

13

Content word overlap

CREFC1u

Proportion of content words that overlap between adjacent sentences

14

LSA sentence adjacent

LSAassa

LSA, Sentence to Sentence, adjacent, mean

15

LSA sentence all

LSApssa

LSA, sentences, all combinations, mean

16

LSA paragraph

LSAppa

LSA, Paragraph to Paragraph, mean

17

Personal pronouns

DENPRPi

Personal pronoun incidence score

18

Pronoun ratio

DENSPR2

Ratio of pronouns to noun phrases

19

Type-token ratio

TYPTOKc

Type-token ratio for all content words

20

Causal content

CAUSVP

Incidence of causal verbs, links, and particles

21

Causal cohesion

CAUSC

Ratio of causal particles to causal verbs (cp divided by cv+1)

22

Intentional content

INTEi

Incidence of intentional actions, events, and particles.

23

Intentional cohesion

INTEC

Ratio of intentional particles to intentional content

24

Syntactic structure similarity adjacent

STRUTa

Sentence syntax similarity, adjacent

25

Syntactic structure similarity all-1

STRUTt

Sentence syntax similarity, all, across paragraphs

26

Syntactic structure similarity all 2

STRUTp

Sentence syntax similarity, sentence all, within paragraphs

27

Temporal cohesion

TEMPta

Mean of tense and aspect repetition scores

28

Spatial cohesion

SPATC

Mean of location and motion ratio scores.

29

All connectives

CONi

Incidence of all connectives

30

Conditional operators

DENCONDi

Number of conditional expressions, incidence score

31

Pos. additive connectives

CONADpi

Incidence of positive additive connectives

32

Pos. temporal connectives

CONTPpi

Incidence of positive temporal connectives

33

Pos. causal connectives

CONCSpi

Incidence of positive causal connectives

34

Pos. logical connectives

CONLGpi

Incidence of positive logical connectives

35

Neg. additive connectives

CONADni

Incidence of negative additive connectives

36

Neg. temporal connectives

CONTPni

Incidence of negative temporal connectives

37

Neg. causal connectives

CONCSni

Incidence of negative causal connectives

38

Neg.logical connectives

CONLGni

Incidence of negative logical connectives

39

Logic operators

DENLOGi

Logical operator incidence score (and + if + or + cond + neg)

40

Raw freq. content words

FRQCRacw

Celex, raw, mean for content words (0-1,000,000)

41

Log freq. content words

FRQCLacw

Celex, logarithm, mean for content words (0-6)

42

Min. raw freq. content words

FRQCRmcs

Celex, raw, minimum in sentence for content words (0-1,000,000)

43

Log min. freq. content words

FRQCLmcs

Celex, logarithm, minimum in sentence for content words (0-6)

44

Concreteness content words

WORDCacw

Concreteness, mean for content words

45

Min. concreteness content words

WORDCmcs

Concreteness, minimum in sentence for content words

46

Noun hypernym

HYNOUNaw

Mean hypernym values of nouns

47

Verb hypernym

HYVERBaw

Mean hypernym values of verbs

48

Negations

DENNEGi

Number of negations, incidence score

49

NP incidence

DENSNP

Noun Phrase Incidence Score (per thousand words)

50

Modifiers per NP

SYNNP

Mean number of modifiers per noun-phrase

51

Higher level constituents

SYNHw

Mean number of higher level constituents per word

52

Words before main verb

SYNLE

Mean number of words before the main verb of main clause in sentences

53

No. of words

READNW

Number of Words

54

No. of sentences

READNS

Number of Sentences

55

No. of paragraphs

READNP

Number of Paragraphs

56

Syllables per word

READASW

Average Syllables per Word

57

Words per sentence

READASL

Average Words per Sentence

58

Sentences per paragraph

READAPL

Average Sentences per Paragraph

59

Flesch Reading Ease

READFRE

Flesch Reading Ease Score (0-100)

60

Flesch-Kincaid

READFKGL

Flesch-Kincaid Grade Level (0-12)

 

 

III. Indices in the Coh-Metrix 2.0 output file

 

The indices in Coh-Metrix 2.0 are categorized into six groups: (1) general identification and reference information, (2) readability indices, (3) general word and text information, (4) syntactic indices, (5) referential and semantic indices, and (6) situational model dimensions.

 

 

1. General identification and reference information (indices1-6)

 

Indices 1-6 are used for identification and reference purposes.

 

 

Title (index 01)

 

The title of text is provided by the user (e.g., How plants grow). The title is used for reference and identification purposes only.

 

 

Genre (index 02)

 

The genre is the general category of the text and is provided by the user (e.g., choose science, narrative or informational). The genre is used for reference and identification purposes only.

 

 

Source (index 03)

 

The source identifies where the text appeared or was published, and is provided by the user (e.g., IIS publications, 2005). The source is used for reference and identification purposes only.

 

 

JobCode (index 04)

 

The Job ID (e.g., JohnExp1) identifies and is provided by the user. The Job ID is used for both reference and identification purposes.  It is strongly suggested that researchers keep a record of the JobCode in order to locate results in the future.

 

 

LSA Space (index 05)

 

The user has the option of choosing from five Latent Semantic Analysis (LSA) spaces : College Level, Science, Narrative, Encyclopedia, and Physics. The LSA space determines the world knowledge or conceptual domain that is used for LSA comparisons. The College Level space is based on the TASA (Touchstone Applied Science Associates, Inc.) corpus that has text files covering novels, newspaper articles, and other information. Science, Narrative, Encyclopedia and Physics spaces are based on large corpora of documents from these genres. The choice of LSA space can have important consequences for LSA scores. Users who do not feel confident about their choice are advised to select the College Level LSA space.

 

 

Date (index 06)

 

The date is provided automatically and serves to help researchers retrieve their data.

 

 

2. Readability indices

 

The traditional method of assessing texts on difficulty consists of various readability formulas. More than 40 readability formulas have been developed over the years (Klare, 1974-1975). The most common formulas are the Flesch Reading Ease Score and the Flesch Kincaid Grade Level.

 

 

Flesch Reading Ease: READFRE(index 59)

 

The output of the Flesch Reading Ease formula is a number from 0 to 100, with a higher score indicating easier reading. The average document has a Flesch Reading Ease score between 6 and 70. The formula is provided below:  

 

READFRE = 206.835 - (1.015 x ASL) - (84.6 x ASW)

 

where:

 

ASL = average sentence length = the number of words divided by the number of sentences. This is the same as READASL.


ASW (comes from CELEX database) = average number of syllables per word = the number of syllables divided by the number of words. This is the same as READASW.

 

 

Flesch-Kincaid Grade Level: READFKGL (index 60)

 

This more common Flesch-Kincaid Grade Level formula converts the Reading Ease Score to a U.S. grade-school level. The higher the number, the harder it is to read the text. The grade levels range from 0 to 12.

 

READFKGL = (.39 x ASL) + (11.8 x ASW) - 15.59

 

In general, a text should generally have more than 200 words before the Flesch Reading Ease and Flesch-Kincaid Grade Level scores can successfully be applied.

 

3. General Word and Text Information

 

The general word and text information includes incidence scores on word and text units (i.e., number of occurrences per 1000 words).  It also includes the mean values of characteristics of content words, such as frequency of usage in the English language and concreteness.  

 

3.1 Basic count

 

Number of words: READNW (index 53)

 

This is the number of words in the entire text.

 

 

Number of sentences: READNS (index 54)

 

This is the number of sentences in the entire text.

 

 

Number of paragraphs: READNP (index 55)

 

This is the number of paragraphs in the entire text. Paragraphs are counted as hard returns, not indents. Some texts, for example, have eight spaces to mark the beginning of a paragraph; however, unless these spaces are preceded by a hard return, Coh-Metrix does not identify this as a paragraph. Whenever two successive hard returns are entered, only one of them will be counted as a paragraph, the one that has characters following it.  Lists of words also count as paragraphs if hard returns are used.

 

 

Syllables per word: READASW (index 56)

 

This is the mean number of syllables per content word, a ratio measure.

 

 

Words per sentence: READASL (index 57)

 

This is the mean number of words per sentence.

 

 

Sentences per paragraph: READAPL (index 58)

 

This is the mean number of sentences per paragraph.

 

3.2 Frequencies

 

 

Raw frequency of content words: FRQCRacw (index 40)

 

This is the mean raw frequency of all of the content words in the text. Content words are nouns, adverbs, adjectives, main verbs, and other categories with rich conceptual content.

 

 

Log frequency of content words: FRQCLacw (index 41)

 

This is the log frequency of all content words in the text. Content words are nouns, adverbs, adjectives, main verbs, and other categories with rich conceptual content. Taking the log of the frequencies rather than the raw scores is compatible with research on reading time (Haberlandt & Graesser, 1985; Just & Carpenter, 1980).

 

Min. raw frequency of content words: FRQCRmcs (index 42)

 

This initially computes the lowest frequency score among all of the content words in each sentence. A mean of these minimum frequency scores is then computed.  Content words are nouns, adverbs, adjectives, main verbs, and other categories with rich conceptual content. A word with the lowest frequency score is the most rare word in the sentence. (Scores range from 0-1,000,000)

 

Log min. raw frequency of content words: FRQCLmcs (index 43)

 

This initially computes the lowest log frequency score among all of the content words in each sentence.  A mean of these minimum log frequency scores is then computed. The logarithm is to the base 10.  Content words are nouns, adverbs, adjectives, main verbs, and other categories with rich conceptual content. The word with the lowest log frequency score is the most rare word in the sentence.  . (Scores range from 0-6)

 

3.3 Concreteness

 

Coh-Metrix 2.0 makes use of the MRC Psycholinguistics Database (Coltheart, 1981), which scales samples of words on particular characteristics. The MRC Psycholinguistics Database contains 150,837 words and provides information of up to 26 different linguistic properties of these words. Most MRC indices are based on psycholinguistic experiments conducted by different researchers, so the coverage of words differs among the indices. Coh-Metrix 2.0 uses the MRC concreteness ratings for a large sample of content words.   Concreteness measures h ow concrete a word is, based on human ratings. High numbers lean toward concrete and low numbers to abstract. Values vary between 100 and 700.

 

 

Concreteness of content words: WORD Cacw (index 44)

 

This is the mean concreteness value of all content words in a text that match a word in the MRC database.

 

 

Minimum concreteness of content words: WORDCmcs (index 45)

 

For each sentence in the text, a content word is identified that has the lowest concreteness rating. This score is the mean of these low-concreteness words across sentences.

3.4 Hypernymy

A word is abstract when it has few distinctive features and few attributes that can be pictured in the mind. One way of measuring the abstractness of a word is by the hypernym values in WordNet (Fellbaum, 1998; Miller, et al., 1990) .

WordNet is an online lexical reference system, the design of which is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into semantic fields of underlying lexical concepts. Some sets of words are functionally synonymous because they have the same or a very similar meaning. There are also relations between synonym sets. In particular, a hypernym metric is the number of levels in a conceptual taxonomic hierarchy above (superordinate to) a word. For example, chair (as a seat) has 7 hypernym levels: seat -> furniture -> furnishings -> instrumentality -> artifact -> object -> entity. A word having more hypernym levels is more concrete. A word with fewer hypernym levels is more abstract.

 

Noun hypernym: HYNOUNaw (index 46)

 

This is the mean hypernym value of nouns in the text.

 

 

Verb hypernym: HYVERBaw (index 47)

 

This is the mean hypernym value of main verbs in the text.

 

4. Syntax Indices

 

Syntactic indices  (abbreviated as SYN) include   a number of metrics that assess syntactic complexity, syntactic composition, and the frequency of particular syntactic classes or constituents in a text. Sentences with difficult syntactic composition are structurally dense, are syntactically ambiguous, have many embedded constituents, or are ungrammatical. The syntactic analyses are based on the Charniak syntactic parser. There are over 50 parts of speech, which are segregated into content and function words. When a word can be assigned to more than one part of speech (POS) category, the most likely category is assigned on the basis of its syntactic context. Moreover, the syntactic context can assign the most likely POS for words it doesn't know. In addition to POS, Coh-Metrix computes the number of noun-phrase (NP) constituents or number of verb-phrase (VP) constituents per 1000 words. 

 Syntactic complexity is measured by Coh-Metrix 2.0 in several ways. First, there is the mean number of modifiers per noun-phrase. A modifier is an optional element that describes the property of a head of a phrase. Modifiers per NP refer to adjectives, adverbs, or determiners that modify the head noun. For example, the noun-phrase the lovely, little girl has three modifiers: the, lovely and little. A second metric is mean number of higher level constituents per sentence, controlling for number of words.  Sentences with difficult syntactic composition are structurally embedded and have a higher incidence of verb-phrases after controlling for number of words.  A third metric is the number of words that appear before the main verb of the main clause in the sentences of a text.  Sentences that have many words before the main verb are taxing on working memory. 

 

4.1 Constituents

 

 

Noun phrase incidence: DENSNP (index 49)   

 

This is the incidence of noun-phrase constituents per 1000 words.

 

Example:

 

Cell division occurs to reproduce and replace cells.

 

In this example, there are two main NPs: cell division and cells . There are eight words, so the incidence score for this sentence is 250.

 

Modifiers per NP: SYNNP (index 50)

 

This is the mean number of modifiers per noun-phrase.  

 

 

Higher level constituents : SYNHw (index 51)

 

Structurally dense sentences tend to have more high order syntactic constituites per word.   

 

 

Words before main verb: SYNLE (index 52)

 

This is the mean number of words before the main verb of the main clause in sentences. This is a good index of working memory load. 

 

 

Negations: DENNEGi (index 48)

 

This is the incidence score for negation expressions.

 

 

 

See Example text (click)

 

See Example Results (click)

 

 

4.2 Pronouns, Types, and Tokens

 

Personal pronoun: DENPRPi (index 17)

 

This is the number of personal pronouns per 1000 words. A high density of pronouns can create referential cohesion problems if the reader does not know what the pronouns refer to.

 

Example:

 

Paul told John that he wanted to help him out.

 

The words he and him in this sentence are both pronouns, leading to a density score of 200. The pronouns, however, are ambiguous as we do not know which pronoun refers to which person.

 

 

Pronoun ratio: DENSPR2 (index 18)

 

This is the ratio of words classified as pronouns to the incidence of noun-phrases in a text. A high density of pronouns compared with the density of noun-phrases creates referential cohesion problems when the reader does not know what the pronouns refer to.

 

Example:

 

The fourth stage of mitosis is called telophase, because telo- means "end", and it begins when all the daughter chromosomes reach the two cell poles.

 

The word it is tagged as a pronoun, whereas phrases such as the fourth stage are tagged as noun phrases. If there is one pronoun and 8 total noun-phrases (the pronoun itself being a noun phrase) then the ratio would be 0.125.

 

 

Type-token ratio: TYPTOKc (index 19)

 

Type-token ratio (TTR) (Templin, 1957) is the number of unique words (called types) divided by the number of tokens of these words. Each unique word in a text is considered a word type. Each instance of a particular word is a token. For example, if the word dog appears in the text 7 times, its type value is 1, whereas its token value is 7. When the type-token ratio approaches 1, each word occurs only once in the text; comprehension should be comparatively difficult because many unique words need to be decoded and integrated with the discourse context. As the type-token ratio decreases, words are repeated many times in the text, which should increase the ease and speed of text processing. Type-token ratios are computed for content words, but not function words. TTR scores are most valuable when texts of similar lengths are compared.

 

Example:

 

Cytokinesis, the second stage of cell division, begins to occur before mitosis is complete (usually during telophase) and continues after the nuclei of the daughter cells are completely formed. The preliminary steps of cytokinesis occur during the growth interphases (called the G phases) of the cell cycle.

 

In these sentences (taken from the text reprinted later in the help facility), the TTR for content words is 0.933. Words such as stage only occur once, but words like cytokinesis and cell appear more than once. Coh-Metrix uses lexeme versions in its calculation rather than lemma or stem versions; for example, cell is considered different from cells.

 

 

 

4.4 Connectives

 

Many strands of cohesion are potentially important and recognized in linguistics, discourse processing, psychology, education, rhetoric and other fields that analyze text. Connectives are one important class of signaling devices for particular categories of cohesion relations in text (Halliday & Hasan, 1976). In dialogue, several classes of discourse markers help connect the thread of conversation (Louwerse & Mitchell, 2003).  The insertion of connectives is known to have a substantial impact on comprehension and memory for text (McNamara, Kintsch, E., Songer, & Kintsch, W., 1996).  

Connectives are classified on two dimensions in Coh-Metrix 2.0. On one dimension, the extension of the situation described by the text is determined. Positive connectives extend events, whereas negative connectives cease to extend the expected events (Louwerse, 2002; Sanders, Spooren & Noordman, 1992). Negative relations are synonymous with adversative relations, as defined in Halliday and Hasan (1976).

 

Positive (p): and, after, because,

 

Negative (n): but, until, although

 

On another dimension, there are connectives associated with the type of cohesion, namely additive, temporal, logical, and causal. Examples are given below.

 

Additives (AD): also, moreover, however, but

 

Causal (CA): because, so, consequently, although, nevertheless

 

Logical (LG): or, actually, if

 

Temporal (TP): after, before, when, until

 

 

All connectives: CONi (index 29)

 

This is the incidence of all connectives.

 

 

Coditional operator: DENCONDi (index 30)

 

Number of conditionals, incidence score.

 

 

Positive additive connectives: CONADpi (index 31)

 

This is the incidence of positive additive connectives.

 

 

Positive temporal connectives: CONTPpi (index 32)

 

This is the incidence of positive temporal connectives.

 

 

Positive causal connectives: CONCSpi (index 33)

 

This is the incidence of positive causal connectives.

 

 

Positive logical connectives: CONLGpi (index 34)

 

This is the incidence of positive logical connectives.

 

 

Negative additive connectives: CONADni (index 35)

 

This is the incidence of negative additive connectives.

 

 

Negative temporal connectives: CONTPni (index 36)

 

This is the incidence of negative temporal connectives.

 

 

Negative causal connectives: CONCSni (index 37)

 

This is the incidence of negative causal connectives.

 

 

Negative logical connectives: CONLGni (index 38)

 

This is the incidence of negative logical connectives.

 

See Example text (click)

 

See Example Results (click)

 


 

4.5  Logical Operators

 

Logical operators are prevalent in syllogisms and texts that express logical reasoning. They include the Boolean operators (and, or, not, if, then) and a small number of other similar cognate terms. Texts with a high density of these logical operators are difficult for most readers. To see the logical operators used in Coh-Metrix click here.

 

 

Logical operators: DENLOGi (index 41)

 

This is the incidence of logical operators. Along with "and" and "or" , and negations, a number of conditionals are also included. To see the conditionals table, click here.

 

4.6. Sentence syntax similarity

 

The sentence syntax similarity indices compare the syntactic tree structures of sentences.  The algorithms build an intersection tree between two syntactic trees, one for each of the two sentences being compared. An index of syntactic similarity between two sentences is the proportion of nodes in the two tree structures that are intersecting nodes.

 

Syntactic structure similarity adjacent: STRUTa (index 24)  

This is the proportion of intersection tree nodes between all adjacent sentences.      

Syntactic structure similarity all 01: STRUTt (index 25)              

This is the proportion of intersection tree nodes between all sentences and across paragraphs.    

Syntactic structure similarity all 02: STRUTp (index 26)  

This is the proportion of intersection tree nodes between all sentences, but within paragraphs.

 

5. Referential and Semantic Indices

 

Referential cohesion occurs when a noun, pronoun, or noun phrase refers to another constituent in the text. For example, consider the sentence When water is heated, it boils. The word it refers to the word water. A referring expression (N) is the noun, pronoun, or noun-phrase that refers to another constituent (C). C is designated as the referent of N. In the example sentence above, the word it is the referring expression N, whereas the referent C is the word water. In most cases, the referent C occurs in the text prior to the referring expression N; this is known as anaphora. However, the referring expression can also precede the referent constituent; this is known as cataphora. An example of cataphora would be When it is heated, water boils.

Referential cohesion has been extensively investigated in the fields of text linguistics and discourse processes, especially in the form of argument overlap (Kintsch & Van Dijk, 1978). Argument overlap occurs when a noun, pronoun, or noun-phrase in one sentence is a coreferent of a noun, pronoun, or noun-phrase in another sentence. The word argument is used in a special sense in this context, namely in contrast to predicates in propositional representations (see Kintsch & Van Dijk, 1978). In this early work, two sentences were regarded as being linked by coreference if they shared a common argument (i.e., an overlapping noun, pronoun, or noun-phrase). However, the early theory was eventually expanded to allow referential overlap between a {noun | pronoun | noun-phrase} N and a referential proposition that has a similar morphological stem as a noun in N. For example, consider the two sentences When water is heated, it boils and eventually evaporates. When the heat is reduced, it turns back into a liquid form. The heat in the second sentence corefers to the proposition Water is heated; heat and heated share the same morphological stem heat, even though one is a noun and the other is a verb.

In addition to referential indices, there are indices that assess the extent to which the content of sentences or paragraphs are similar semantically or conceptually.  Coherence is predicted to increase as a function of similarity.  One index of semantic similarity is content word overlap, which is the proportion of content words in two excerpts that share common content words.  Another method of computing similarity is through Latent Semantic Analysis (LSA).  LSA is a mathematical, statistical technique for representing world knowledge, based on a large corpus of texts. LSA uses singular value decomposition, a general form of principle component analysis, to condense a very large corpus of texts to 100-500 dimensions (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). The conceptual similarity between any two text excerpts (e.g., word, clause, sentence, text) is evaluated by these 100-500 functional dimensions. There are many other statistical metrics for computing similarity other than LSA (see Landauer, McNamara, Dennis, & Kintsch, in press; Millis, Kim, Todaro, Magliano, Wiemer-Hastings, & McNamara, 2004). 

 

5.1 Anaphor

 

Adjacent anaphor reference : CREFP1u (index 7)

 

This is the proportion of anaphor references between adjacent sentences.

 

Example:

There are four distinct phases of mitosis called prophase , metaphase , anaphase , and telophase . These four phases are well known to researchers who can easily observe them with, for example, the simple light microscope.

 

In this example, the pronoun them refers to phases in the previous sentence.

 

Anaphor reference: CREFPau (index 8)

 

This is the proportion of unweighted anaphor references that refer back to a constituent up to 5 sentences earlier.

 

5.2 Coreference

 

Coh-Metrix currently considers three forms of coreference between sentences. For any two sentences s1 and s2, if there exists a noun common to both, then the two sentences are considered noun overlapped. If there exists two nouns, one from s1 and the other from s2, that share a common stem, then the two sentences are considered noun stem overlapped. If a noun from one sentence, s1, has a stem that is shared by any category of word in the other sentence, s2, then the two sentences are stem overlapped.

 

 

Adjacent argument overlap: CREFA1u (index 9)

This is the proportion of adjacent sentences that share one or more arguments (i.e., noun, pronoun, noun-phrase).  

                        Example: 

Cell division occurs to reproduce and replace cells. The division of cells with a membrane-bound nucleus and organelles (eucaryotic cells) involves two distinct but overlapping stages, mitosis and cytokinesis .

 

In this example, the word cells overlaps between two adjacent sentences. This excerpt is part of the sample text printed at the bottom of this help facility.

 

Argument overlap: CREFAau (index 10)

This is the proportion of all sentence pairs in a paragraph that share one or more arguments (i.e., noun, pronoun, noun-phrase).

 

Adjacent stem overlap: CREFS1u (index 11)

This is the proportion of adjacent sentences that share one or more word stems.

Example:

The division of cells with a membrane-bound nucleus and organelles (eucaryotic cells) involves two distinct but overlapping stages, mitosis and cytokinesis. Mitosis occurs to replicate the cell's genetic material in the nucleus, whereas cytokinesis occurs to divide the gel-like liquid surrounding the cell's nucleus, called cytoplasm.

 

In this example, taken from the sample text printed at the bottom of this help facility, the word division has a stem overlap with divide.

 

Stem overlap: CREFSau (index 12)

This is the proportion of all sentence pairs in a paragraph that share one or more word stems. When the text is extremely long, it is not possible to compute this measure because the computation requires a large amount of processing time.

 

Content word overlap: CREFC1u (index 13)

This is the proportion of content words in adjacent sentences that share common content words. 

 

 

See entire example text (click)

 

See Example Results (click)

 

5.3 Latent Semantic Analysis (LSA)

 

Latent Semantic Analysis (LSA) is a mathematical, statistical technique for representing world knowledge, based on a large corpus of texts. LSA uses singular value decomposition, a general form of principle component analysis, to condense a very large corpus of texts to 100-500 dimensions (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). The conceptual similarity between any two text excerpts (e.g., word, clause, sentence, text) is evaluated by these 100-500 functional dimensions. In Coh-Metrix, therefore, the sentences, paragraphs and entire texts are represented by LSA vectors of 100-500 dimensions. The "cosine" angle between vectors is used to measure the similarity between excerpts. Text cohesion is assumed to increase as a function of higher cosine scores between text constituents.

There are several methods of computing LSA cohesion, as specified below. Both the means and standard deviations may be computed when several pairs of cosine similarity scores are part of the computation. Coh-Metrix 2.0 reports means but not standard deviations.

 

 

LSA sentence adjacent: LSAassa (index 14)

 

This index computes mean LSA cosines for adjacent, sentence-to-sentence (abbreviated as "ass") units. This measures how conceptually similar each sentence is to the next sentence.

 

Example:

 

Text 1: The field was full of lush, green grass. The horses grazed peacefully. The young children played with kites. The women occasionally looked up, but only occasionally. A warm summer breeze blew and everyone, for once, was almost happy.

 

Text 2: The field was full of lush, green grass. An elephant is a large animal. No-one appreciates being lied to. What are we going to have for dinner tonight?

 

In the example texts printed above, Text 1 records much higher LSA scores than Text 2. The words in Text 1 tend to be thematically related to a pleasant day in an idyllic park scene: green, grass, children, playing, summer, breeze, kites, andhappy, In contrast, the sentences in Text 2 tend to be unrelated.

 

 

LSA sentence all:LSApssa (index 15)

 

Like LSA sentence adjacent (LSAassa), this index computes mean LSA cosines.  However, for this index all sentence combinations are considered, not just adjacent sentences. LSApssa computes how conceptually similar each sentence is to every other sentence in the text.

 

 

LSA paragraph: LSA ppa (index 16)

 

This computes LSA cosines for paragraph-to-paragraph (pp) units. This measures how similar paragraphs are to the other paragraphs in the text. This measure cannot be computed for texts with only one paragraph and with texts that ignore paragraph junctures.

 

 

See Example text (click)

 

See Example Results (click)

 

6. Situation model dimensions

 

Many aspects of a text can contribute to the situation model (or mental model), the referential content or microworld of what a text is about (Graesser, Millis, & Zwaan, 1997; Kintsch, 1998; van Dijk & Kintsch, 1983). Text comprehension researches have investigated at least five situational dimensions (Zwaan & Radvansky, 1998): causation, intentionality, time, space and protagonists. All of these situational dimensions can be indicated in a text by connectives, particles, nouns and verbs. In Coh-Metrix 2.0, the protagonist dimension is not analyzed.

 

6.1 Causal dimension

 

Causal cohesion reflects the extent to which sentences are related by causal cohesion relations. Causal cohesion relations are appropriate when the text refers to events and actions that are related causally, as in the case of science texts with causal mechanisms and stories with an action plot. Causality is not relevant, for example, in texts that describe static scenes and texts that convey abstract logical arguments.

Coh-Metrix 2.0 needs to first estimate how much of the text refers to events and actions that may be part of causal content. This is accomplished by counting the number of main verbs that are causal, based on WordNet (Fellbaum, 1998; Miller, Beckwith, Fellbaum, Gross, & Miller, 1990) . WordNet is a lexical database that contains a large number of semantic characteristics of words. A verb is classified as "causal" if it belongs to certain WordNet categories. The higher the incidence of causal verbs in a text, the more the text is assumed to convey causal content.

Having causal verbs in a text does not insure that the reader can connect these events and actions with causal relations. According to Coh-Metrix 2.0, causal cohesion relations are signaled by causal particles (click to see causal particles). Some causal particles are conjunctions, transitional adverbs, and other forms of connectives, such as since, so that, because, and consequently. These particles are used to indicate some causal relationship between clauses that refer to events and actions. Other causal particles consist of a small number of verbs that explicitly assert there is a causal relationship between constituents, without specifying the nature of the causal content: cause, enable, make. It should be noted that and and or are not classified as causal particles.

 

Causal content: CAUSVP (index 20)

 

This is the incidence of causal verbs and causal particles in text.

 

 

Causal cohesion: CAUSC (index 21)

 

This is a ratio of causal particles (P) to causal verbs (V). The denominator is incremented by the value of 1 to handle the rare case when there are 0 causal verbs in the text. Cohesion suffers when the text has many causal verbs (signifying events and actions) but few causal particles that signal how the events and actions are connected.

 

To see the list of causal particles click here

 

6.2 Intentional dimension

 

Intentional cohesion reflects the extent to which sentences are related by intentional cohesion relations. Intentional cohesion relations are appropriate when the text refers to animate protagonists who perform actions in pursuit of goals, as in the case of simple stories and other forms of narrative (Singer & Halldorson, 1996; Van den Broek & Trabasso, 1996). Intentionality is not relevant, for example, in texts that describe events that are not goal directed and not executed by animate agents (e.g., mechanisms that cause volcanoes).   

In Coh-Metrix 2.0, the incidence of intentional actions and events is estimated by counting the number of main verbs that are intentional, based on WordNet (Fellbaum, 1998; Miller, Beckwith, Fellbaum, Gross, & Miller, 1990), and that are performed by animate subject nouns (according to WordNet).  WordNet is a lexical database that contains a large number of semantic characteristics of words. A verb is classified as "intentional" if it belongs to particular WordNet categories. The higher the incidence of intentional actions in a text, the more the text is assumed to convey goal-driven content.

Intentional cohesion is the ratio of intentional particles (e.g., in order to, so that, for the purpose of, by means of, by, wanted to) to the incidence of causal content.   

 

Intentional content: INTEi (index 22)

 

This is the incidence of intentional actions, events, and particles (per thousand words).

 

 

Intentional cohesion: INTEC (index 23)

 

This is the ratio of intentional particles to intentional actions/events. 

 

 

6.3 Temporal dimension

 

Temporal cohesion reflects the extent to which sentences are related by temporal cohesion relations. Temporal cohesion relations are appropriate when the text refers to events or actions.  The actions and events may be articulated in different tenses (past, present, future) and different aspects (e.g., in progress, completed, vs. static).  For example X died is in the past tense and completed, X is dying is present and in progress, X is dead is static, whereas X will have died is future and completed.  . Temporal cohesion is measured by the repetition scores when analyzing the sequence of verbs that are classified in a particular tense and aspect. 

 

 

Temporal cohesion: TEMPta (index 27)

 

This is the repetition score for tense and aspect.  The repetition score for tense is averaged with the repetition score for aspect. 

 

 

6.4 Spatial dimension

 

Spatial cohesion reflects the extent to which the text has spatial content and the sentences are related by spatial particles or relations. Spatial content includes location nouns and prepositions, as well as motion actions and prepositions, based on WordNet (Fellbaum, 1998; Miller, Beckwith, Fellbaum, Gross, & Miller, 1990) . WordNet is a lexical database that contains a large number of semantic characteristics of words. Location nouns are defined by WordNet as locational, such as Memphis, place, Central Park.  Location prepositions (in, by, near), deictic references (here, there), and other particles play a role in relating the nouns in space.  Motion verbs (go, run, drive) are related by particles that refer to spatial and deictic indexes (from, to, through, by, between, here, there). 

In Coh-Metrix 2.0, the location ratio score is the incidence of location prepositions (LSP) divided by LSP plus the incidence of location nouns. The motion ratio score is the incidence of motion particles (MSP) divided by MSP and the incidence of motion verbs. Location cohesion is the average of these two ratios. 

 

Spatial cohesion : SPATC (index 28)

 

Mean of location and motion ratio scores. The location ratio score is the incidence of location prepositions (LSP) divided by LSP plus the incidence of location nouns. The motion ratio score is the incidence of motion prepositions (MSP) divided by MSP and the incidence of motion verbs.


IV References

Herdan, G. (1960). Type token mathematics: A textbook of mathematical linguistics. Gravenhage, Mouton.

Baayen, R. H., R. Piepenbrock, and H. van Rijn (Eds.) (1993). The CELEX Lexical Database (CD-ROM). University of Pennsylvania, Philadelphia (PA): Linguistic Data Consortium.

Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, ACL.

Coltheart, M. (1981). The MRC psycholinguistic database quarterly. Journal of Experimental Psychology, 33A, 497-505.   

Deerwester, S. S., Dumais, T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41, 391-407.

Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

Graesser,A.C., McNamara,D.S.,& Louwerse,M.M (2003).   What do readers need to learn in order to process coherence relations in narrative and expository text.  In A.P. Sweet and C.E. Snow (Eds.), Rethinking reading comprehension (pp. 82-98). New York: Guilford Publications.    

Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, and Computers 36, 193-202.

Graesser, A.C., Millis, K.K., & Zwaan, R.A. (1997).  Discourse comprehension.  In J.T. Spence, J.M. Darley, and D.J. Foss (Eds.), Annual Review of Psychology, Vol. 48.  Palo Alto, CA: Annual Reviews Inc.    

Haberlandt, K., & Graesser, A. C. (1985).   Component processes in text comprehension and some of their interactions.    Journal of Experimental Psychology: General, 114, 357-374.

Halliday, M. A. K. & Hasan, R. (1976). Cohesion in English. London : Longman.

Just, M. A. & Carpenter, P. A. (1980). A theory of reading: From eye fixations to comprehension. Psychological Review, 87, 329-354.   

Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge: Cambridge University Press.   

Kintsch, W., & Van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 363-394.

Klare, G. R. (1974-1975). Assessing readability. Reading Research Quarterly, 10, 62-102.

Knott, A. & Dale, R. (1994). Using linguistic phenomena to motivate a set of rhetorical relations. Discourse Processes, 18, 35-62.   

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259-284.

Landauer, T., McNamara, D., Dennis, S., & Kintsch, W. (in press)(Eds.), LSA: A Road to meaning.  Mahwah, NJ: Erlbaum. 

Louwerse, M.M. (2002). An analytic and cognitive parameterization of coherence relations. Cognitive Linguistics, 291?15.   

Louwerse, M.M., & Mitchell, H.H. (2003).  Toward a taxonomy of a set of discourse markers in dialog: A theoretical and computational linguistic account.  Discourse Processes, 35, 199-239.

Martin, J. R. (1992). English text: System and structure. Amsterdam : Benjamins.

McNamara, D.S., Kintsch, E., Songer, N.B., & Kintsch, W. (1996). Are good texts always better? Text coherence, background knowledge, and levels of understanding in learning from text. Cognition and Instruction, 14, 1-43.   

McNamara, D. S., Louwerse, M. M. & Graesser, A. C. (2002). Coh-MetrixCoh-Metrix: Automated cohesion and coherence scores to predict text readability and facilitate comprehension. Institute for Intelligent Systems, University of Memphis, Memphis, TN.

Miller, G.A, Beckwith, R., Fellbaum, C., Gross, D. & Miller, K. (1990).   Five papers on WordNet.   Cognitive Science Laboratory, Princeton University , No. 43.

Millis, K. K., Kim, H. J., Todaro, S. Magliano, J., Wiemer-Hastings, K., & McNamara, D.S. (2004). Identifying reading strategies using latent semantic analysis: Comparing semantic benchmarks. Behavior Research Methods, Instruments, & Computers, 36, 213-221.

Myers, JL & O'Brien, EJ (1998). Accessing the discourse representation during reading. Discourse Processes, 26, 131-157.

Sanders, T. J. M., Spooren, W. P. M., & Noordman, L. G. M. (1992). Toward a taxonomy of coherence relations. Discourse Processes, 15, 1-35.

Sanders, T. J. M., Spooren, W. P. M., & Noordman, L. G. M. (1993). Coherence relations in a cognitive theory of discourse representation. Cognitive Linguistics, 4, 93-133.

Sekine S., & Grishman R. (1995). A corpus-based probabilistic grammar with only two non-terminals. In the Fourth International Workshop on Parsing Technology. Prague, Czech.

Singer, M., & Halldorson, M. (1996).  Constructing and validating motive bridging inferences.  Cognitive Psychology, 30, 1-38.    

Templin, M. C. (1957). Certain language skills in children, their development and interrelationships Minneapolis, MN: University of Minnesota Press.

Van den Broek, P., & Trabasso, T. (1986; Causal networks versus goal hierarchies in summarizing texts.  Discourse Processes, 9, 1-15. 

Van Dijk, T.A., & Kintsch, W. (1983).  Strategies of discourse comprehension.  New York : Academic Press.

Zwaan, R. A., & Radvansky, G. A. (1998). Situation models in language comprehension and memory. Psychological Bulletin, 123, 162-185.

Zwaan, R. A., Langston M. C., & Graesser, A. C. (1995). The construction of situation models in narrative comprehension: An event-indexing model. Psychological Science, 6, 292-297.

 

Example Text and Output

 

Example Text           

Coh-Metrix Output for Example Text