Coh-Metrix version 2.0 indices
Table of contents
III. Indices in the Coh-Metrix 2.0 output file
1. General identification and reference information
3. General word information and text information
5.
Referential and semantic indices
IV. References
Coh-Metrix is a computational tool that produces indices of
the linguistic and discourse representations of a text. These values can be used
in many different ways to investigate the cohesion of the explicit text and the
coherence of the mental representation of the text. Our definition of cohesion consists
of characteristics of the explicit text that play some role in helping the
reader mentally connect ideas in the text (Graesser, McNamara, & Louwerse,
2003). The definition of coherence is the subject of much debate. Theoretically,
the coherence of a text is defined by the interaction between linguistic
representations and knowledge representations. When we put the spotlight on the
text, however, coherence can be defined as characteristics of the text (i.e.,
aspects of cohesion) that are likely to contribute to the coherence of the
mental representation. Coh-Metrix provides indices of such cohesion
characteristics.
1. Preliminary
information
This document contains a
description of 60 indices incorporated into the website Coh-Metrix version 2.0.
These descriptions are intended to be succinct specifications for people who
want to work on this version of Coh-Metrix. More theoretical information on the
Coh-Metrix indices and architecture is reported in Graesser, McNamara, Louwerse
and Cai (2004) and McNamara, Louwerse and Graesser (2002). For each level of
indices, example scores are provided. However, it is important to note that even
small changes in texts can lead to large changes in Coh-Metrix output. It is
also important to note that the scores are often subject to the output of third
party parsers, lexicons, and word frequency databases, all of which are outside
of the control of Coh-Metrix.
2. Coh-Metrix
concepts
Some definitions of key concepts
are needed to specify the algorithms that underlie the indices in Coh-Metrix
2.0. We define
these concepts in this section.
Adjacent versus
All sentences
Adjacent sentences are successive sentences in a span of text. For example, if a
span of text has 4 sentences, then the adjacent
sentences would be sentences 1-2, 2-3, and 3-4. In
contrast, all
sentences are all possible pairs of sentences: 1-2, 2-3, 3-4, 1-3,
1-4, and 2-4. A span
of text may be defined in different ways for different
purposes, but there is a distinction between paragraph spans and the span of an
entire document. The adjacent sentences in Coh-Metrix 2.0 ignore junctures
between paragraphs.
Weighted versus
Unweighted distances between sentences
This distinction is pervasive in a more advanced version of
Coh-Metrix (1.2), but not in most indices used in Coh-Metrix 2.0. When the
distance between sentences is weighted, the weight between two sentences
decreases the further they are apart in the text. All sentence pairs have equal
weight when distances are unweighted. With rare exception, distances between
sentences are unweighted
in Coh-Metrix 2.0.
Incidence
scores versus Ratio scores
An incidence
score is the number of classified units per 1000 words. For
example, the incidence score for pronouns would compute the number of words that
are classified as pronouns for a span of 1000 words. It is equivalent to what
some researchers call rates or density scores. In contrast, a ratio
score is a relative measure that compares the incidence of
one class of units to the incidence of another class of units. For example, a
pronoun ratio is the incidence of pronouns divided by the incidence of
noun-phrases. Ratio scores compare two different metrics (classes of units)
whereas an incidence score applies to only one metric.
Repetition score
A repetition score is computed on sequences of text units
that are classified into categories. This score is the proportion of adjacent
pairs of units in the sequence that are in the same category. If there are N
units in a sequence, there are (N-1) adjacent pairs. The number of
adjacent pairs in the same category is divided by N-1. For example, we have
computed the repetition score for a sequence of categories A, B, and C:
Category sequence: A BB B C A A
C C B B B B A C C
Adjacency repetition0 1 1 0 0 1 0 1 0 1 1 1 0 0
1
The repetition score for this
sequence is 8 / 15.
II. Overview
of Coh-Metrix 2.0 Indices
The table of indices in the Coh-Metrix 2.0 output file has
four columns. The columns are: (1) the ordinal number of the index, (2) a short
description or label for the index, (3) a coded index that has a maximum of 8
characters and that assists in reconstructing relevant distinctions and
parameters in some software packages (e.g., SPSS), and (4) an expanded full
description of the index. Scores for these indices will appear between the 3rd and 4th column.
Separate columns will be added for each additional text. Details about an index
can be accessed by clicking on its label in the Description column.
Table 1: Indices in the Coh-Metrix 2.0 Output File
|
No.
|
Description
|
Measure
|
Full
description
|
|
1
|
Title
|
Title
|
Title
|
|
2
|
Genre
|
Genre
|
Genre
|
|
3
|
Source
|
Source
|
Source
|
|
4
|
JobCode
|
JobCode
|
JobCode
|
|
5
|
LSASpace
|
LSASpace
|
LSASpace
|
|
6
|
Date
|
Date
|
Date
|
|
7
|
CREFP1u
|
Anaphor reference,
adjacent, unweighted
|
|
|
8
|
CREFPau
|
Anaphor reference, all
distances, unweighted
|
|
|
9
|
CREFA1u
|
Argument Overlap,
adjacent, unweighted
|
|
|
10
|
CREFAau
|
Argument Overlap, all
distances, unweighted
|
|
|
11
|
CREFS1u
|
Stem Overlap,
adjacent, unweighted
|
|
|
12
|
CREFSau
|
Stem Overlap, all
distances, unweighted
|
|
|
13
|
CREFC1u
|
Proportion of content
words that overlap between adjacent sentences
|
|
|
14
|
LSAassa
|
LSA, Sentence to
Sentence, adjacent, mean
|
|
|
15
|
LSApssa
|
LSA, sentences, all
combinations, mean
|
|
|
16
|
LSAppa
|
LSA, Paragraph to
Paragraph, mean
|
|
|
17
|
DENPRPi
|
Personal pronoun
incidence score
|
|
|
18
|
DENSPR2
|
Ratio of pronouns to
noun phrases
|
|
|
19
|
TYPTOKc
|
Type-token ratio for
all content words
|
|
|
20
|
CAUSVP
|
Incidence of causal
verbs, links, and particles
|
|
|
21
|
CAUSC
|
Ratio of causal
particles to causal verbs (cp divided by cv+1)
|
|
|
22
|
INTEi
|
Incidence of
intentional actions, events, and particles.
|
|
|
23
|
INTEC
|
Ratio of intentional
particles to intentional content
|
|
|
24
|
STRUTa
|
Sentence syntax
similarity, adjacent
|
|
|
25
|
STRUTt
|
Sentence syntax
similarity, all, across paragraphs
|
|
|
26
|
STRUTp
|
Sentence syntax
similarity, sentence all, within paragraphs
|
|
|
27
|
TEMPta
|
Mean of tense and
aspect repetition scores
|
|
|
28
|
SPATC
|
Mean of location and
motion ratio scores.
|
|
|
29
|
CONi
|
Incidence of all
connectives
|
|
|
30
|
DENCONDi
|
Number of conditional
expressions, incidence score
|
|
|
31
|
CONADpi
|
Incidence of positive
additive connectives
|
|
|
32
|
CONTPpi
|
Incidence of positive
temporal connectives
|
|
|
33
|
CONCSpi
|
Incidence of positive
causal connectives
|
|
|
34
|
CONLGpi
|
Incidence of positive
logical connectives
|
|
|
35
|
CONADni
|
Incidence of negative
additive connectives
|
|
|
36
|
CONTPni
|
Incidence of negative
temporal connectives
|
|
|
37
|
CONCSni
|
Incidence of negative
causal connectives
|
|
|
38
|
CONLGni
|
Incidence of negative
logical connectives
|
|
|
39
|
DENLOGi
|
Logical operator
incidence score (and + if + or + cond + neg)
|
|
|
40
|
FRQCRacw
|
Celex, raw, mean for
content words (0-1,000,000)
|
|
|
41
|
FRQCLacw
|
Celex, logarithm, mean
for content words (0-6)
|
|
|
42
|
FRQCRmcs
|
Celex, raw, minimum in
sentence for content words (0-1,000,000)
|
|
|
43
|
FRQCLmcs
|
Celex, logarithm,
minimum in sentence for content words (0-6)
|
|
|
44
|
WORDCacw
|
Concreteness, mean for
content words
|
|
|
45
|
WORDCmcs
|
Concreteness, minimum
in sentence for content words
|
|
|
46
|
HYNOUNaw
|
Mean hypernym values
of nouns
|
|
|
47
|
HYVERBaw
|
Mean hypernym values
of verbs
|
|
|
48
|
DENNEGi
|
Number of negations,
incidence score
|
|
|
49
|
DENSNP
|
Noun Phrase Incidence
Score (per thousand words)
|
|
|
50
|
SYNNP
|
Mean number of
modifiers per noun-phrase
|
|
|
51
|
SYNHw
|
Mean number of higher
level constituents per word
|
|
|
52
|
SYNLE
|
Mean number of words
before the main verb of main clause in
sentences
|
|
|
53
|
READNW
|
Number of
Words
|
|
|
54
|
READNS
|
Number of
Sentences
|
|
|
55
|
READNP
|
Number of
Paragraphs
|
|
|
56
|
READASW
|
Average Syllables per
Word
|
|
|
57
|
READASL
|
Average Words per
Sentence
|
|
|
58
|
READAPL
|
Average Sentences per
Paragraph
|
|
|
59
|
READFRE
|
Flesch
|
|
|
60
|
READFKGL
|
Flesch-Kincaid Grade
Level (0-12)
|
The indices in Coh-Metrix 2.0 are categorized into six
groups: (1) general identification and reference information, (2) readability
indices, (3) general word and text information, (4) syntactic indices, (5)
referential and semantic indices, and (6) situational model dimensions.
1. General identification and reference information
(indices1-6)
Indices 1-6 are used for identification and reference
purposes.
Title (index
01)
The title
of text is provided by the user (e.g., How plants grow).
The title is used for reference and identification purposes only.
Genre
(index 02)
The genre
is the general category of the text and is provided by the
user (e.g., choose science, narrative or informational). The genre is used for
reference and identification purposes only.
Source
(index 03)
The source
identifies where the text appeared or was published, and is
provided by the user (e.g., IIS publications, 2005). The source is used for
reference and identification purposes only.
JobCode
(index 04)
The Job ID
(e.g., JohnExp1) identifies and is provided by the
user. The Job ID is used for both reference and identification purposes. It is strongly
suggested that researchers keep a record of the JobCode in order to locate
results in the future.
LSA
Space
(index 05)
The user has the option of choosing from five Latent
Semantic Analysis (LSA) spaces
: College Level, Science, Narrative, Encyclopedia, and
Physics. The LSA space determines the world knowledge or conceptual domain that
is used for LSA comparisons. The College Level space is based on the TASA
(Touchstone Applied Science Associates, Inc.) corpus that has text files
covering novels, newspaper articles, and other information. Science, Narrative,
Encyclopedia and Physics spaces are based on large corpora of documents from
these genres. The choice of LSA space can have important consequences for LSA
scores. Users who do not feel confident about their choice are advised to select
the College Level LSA space.
Date
(index 06)
The date
is provided automatically and serves to help researchers
retrieve their data.
The traditional method of assessing texts on difficulty
consists of various readability formulas. More than 40 readability formulas have
been developed over the years (Klare, 1974-1975). The most common formulas are
the Flesch Reading Ease Score and the Flesch Kincaid Grade Level.
Flesch Reading
Ease: READFRE(index
59)
The output of the Flesch Reading Ease formula is a number
from 0 to 100, with a higher score indicating easier reading. The average
document has a Flesch Reading Ease score between 6 and 70. The formula is
provided below:
READFRE = 206.835 - (1.015 x ASL) - (84.6 x ASW)
where:
ASL
= average sentence length = the number of words divided by the number of
sentences. This is the same as READASL.
ASW
(comes from CELEX database) = average number of syllables per
word = the number of syllables divided by the number of words. This is the same
as READASW.
Flesch-Kincaid
Grade Level: READFKGL
(index 60)
This more common Flesch-Kincaid Grade Level formula
converts the Reading Ease Score to a U.S. grade-school level. The higher the
number, the harder it is to read the text. The grade levels range from 0 to 12.
READFKGL = (.39 x ASL) + (11.8 x ASW) - 15.59
In general, a text should generally have more than 200
words before the Flesch Reading Ease and Flesch-Kincaid Grade Level scores can
successfully be applied.
3. General
Word and Text Information
The general word and text
information includes incidence scores on word and text units (i.e., number of
occurrences per 1000 words). It also includes the mean values of
characteristics of content words, such as frequency of usage in the English
language and concreteness.
3.1 Basic
count
Number of
words: READNW
(index 53)
This is the number of words in
the entire text.
Number of
sentences: READNS
(index 54)
This is the number of sentences
in the entire text.
Number of
paragraphs: READNP (index 55)
This is the number of paragraphs in the entire text.
Paragraphs are counted as hard returns, not indents. Some texts, for example,
have eight spaces to mark the beginning of a paragraph; however, unless these
spaces are preceded by a hard return, Coh-Metrix does not identify this as a
paragraph. Whenever two successive hard returns are entered, only one of
them will be counted as a paragraph, the one that has characters following it.
Lists of words
also count as paragraphs if hard returns are used.
Syllables per
word: READASW (index 56)
This is the mean number of
syllables per content word, a ratio measure.
Words per
sentence: READASL (index 57)
This is the mean number of words
per sentence.
Sentences per
paragraph: READAPL (index 58)
This is the mean number of
sentences per paragraph.
3.2
Frequencies
Raw frequency
of content words: FRQCRacw
(index
40)
This is the mean raw frequency of all of the content words
in the text. Content words are nouns, adverbs, adjectives, main verbs, and other
categories with rich conceptual content.
Log frequency
of content words: FRQCLacw
(index
41)
This is the log frequency of all content words in the text.
Content words are nouns, adverbs, adjectives, main verbs, and other categories
with rich conceptual content. Taking the log of the frequencies rather than the
raw scores is compatible with research on reading time (Haberlandt &
Graesser, 1985; Just & Carpenter, 1980).
Min. raw
frequency of content words: FRQCRmcs
(index
42)
This initially computes the lowest frequency score among
all of the content words in each sentence. A mean of these minimum frequency
scores is then computed. Content words are nouns, adverbs, adjectives,
main verbs, and other categories with rich conceptual content. A word with the
lowest frequency score is the most rare word in the sentence. (Scores range from
0-1,000,000)
Log min. raw
frequency of content words: FRQCLmcs
(index
43)
This initially computes the lowest log frequency score
among all of the content words in each sentence. A mean of these minimum log frequency scores
is then computed. The logarithm is to the base 10. Content words are
nouns, adverbs, adjectives, main verbs, and other categories with rich
conceptual content. The word with the lowest log frequency score is the most
rare word in the sentence. . (Scores range from 0-6)
Coh-Metrix 2.0 makes use of the MRC Psycholinguistics
Database (Coltheart, 1981), which scales samples of words on particular
characteristics. The MRC Psycholinguistics Database contains 150,837 words and
provides information of up to 26 different linguistic properties of these words.
Most MRC indices are based on psycholinguistic experiments conducted by
different researchers, so the coverage of words differs among the indices.
Coh-Metrix 2.0 uses the MRC
concreteness
ratings for a large sample of content words.
Concreteness measures h
ow
concrete a word is, based on human ratings. High numbers lean toward concrete
and low numbers to abstract. Values vary between 100 and 700.
Concreteness
of content words: WORD
Cacw
(index
44)
This
is the mean concreteness value of all content words in a text that match a word
in the MRC database.
Minimum
concreteness of content words: WORDCmcs
(index 45)
For
each sentence in the text, a content word is identified that has the lowest
concreteness rating. This score is the mean of these low-concreteness words
across sentences.
3.4 Hypernymy
A word is abstract when it has few distinctive features and
few attributes that can be pictured in the mind. One way of measuring the
abstractness of a word is by the hypernym values in
WordNet (Fellbaum, 1998;
Miller, et al., 1990)
.
WordNet is an online lexical reference system, the design
of which is inspired by current psycholinguistic theories of human lexical
memory. English nouns, verbs, adjectives and adverbs are organized into semantic
fields of underlying lexical concepts. Some sets of words are functionally
synonymous because they have the same or a very similar meaning. There are also
relations between synonym sets. In particular, a hypernym metric is the number of levels in a conceptual
taxonomic hierarchy above (superordinate to) a word. For example, chair (as a
seat) has 7 hypernym levels: seat -> furniture -> furnishings ->
instrumentality -> artifact -> object -> entity. A word having more
hypernym levels is more concrete. A word with fewer hypernym levels is more
abstract.
Noun hypernym:
HYNOUNaw (index
46)
This
is the mean hypernym value of nouns in the text.
Verb hypernym:
HYVERBaw (index
47)
This
is the mean hypernym value of main verbs in the text.
Syntactic indices (abbreviated as SYN)
include
a number of metrics that assess syntactic
complexity, syntactic composition, and the frequency of particular syntactic
classes or constituents in a text. Sentences with difficult
syntactic composition are structurally dense, are syntactically ambiguous, have many embedded constituents,
or are ungrammatical. The syntactic analyses are based on the Charniak syntactic
parser. There are over 50 parts of speech, which are segregated into content and
function words. When a word can be assigned to more than one part of speech
(POS) category, the most likely category is assigned on the basis of its
syntactic context. Moreover, the syntactic context can assign the most likely
POS for words it doesn't know. In addition to POS, Coh-Metrix computes the
number of noun-phrase (NP) constituents or number of verb-phrase (VP)
constituents per 1000 words.
Syntactic complexity is measured by Coh-Metrix
2.0 in several ways. First, there is the mean number of modifiers per
noun-phrase. A modifier is an optional element that describes the property of a
head of a phrase. Modifiers per NP refer to adjectives, adverbs, or determiners
that modify the head noun. For example, the noun-phrase the
lovely, little girl has three modifiers: the, lovely and little. A second metric is mean number of higher level
constituents per sentence, controlling for number of words. Sentences with
difficult syntactic composition are structurally embedded and have a higher
incidence of verb-phrases after controlling for number of words. A third metric is
the number of words that appear before the main verb of the main clause in the
sentences of a text. Sentences that have many words before the main
verb are taxing on working memory.
4.1 Constituents
Noun phrase incidence: DENSNP (index 49)
This is the incidence of
noun-phrase constituents per 1000 words.
Example:
Cell division
occurs to reproduce and replace cells.
In this example, there are two main NPs:
cell
division
and
cells
. There are eight words, so the incidence score for this
sentence is 250.
Modifiers per
NP: SYNNP
(index
50)
This is the mean number of
modifiers per noun-phrase.
Higher level constituents
: SYNHw
(index 51)
Structurally dense sentences tend to have more high order syntactic constituites per word.
Words before
main verb: SYNLE (index
52)
This is the mean number of words before the main verb of
the main clause in sentences. This is a good index of working memory load.
Negations: DENNEGi (index
48)
This
is the incidence score for negation expressions.
4.2 Pronouns, Types, and Tokens
Personal
pronoun: DENPRPi (index 17)
This
is the number of personal pronouns per 1000 words. A high density of pronouns
can create referential cohesion problems if the reader does not know what the
pronouns refer to.
Example:
Paul told John that he wanted to help him out.
The
words he and him in this
sentence are both pronouns, leading to a density score of 200. The pronouns,
however, are ambiguous as we do not know which pronoun refers to which person.
Pronoun ratio:
DENSPR2 (index 18)
This
is the ratio of words classified as pronouns to the incidence of noun-phrases in
a text. A high density of pronouns compared with the density of noun-phrases
creates referential cohesion problems when the reader does not know what the
pronouns refer to.
Example:
The fourth
stage of mitosis is called telophase, because telo- means "end", and it begins
when all the daughter chromosomes reach the two cell poles.
The word it is tagged as a pronoun, whereas phrases such as the
fourth stage
are tagged as noun phrases. If there is one pronoun and 8 total noun-phrases
(the pronoun itself being a noun phrase) then the ratio would be 0.125.
Type-token
ratio: TYPTOKc (index
19)
Type-token ratio (TTR) (Templin,
1957) is the number of unique words (called types) divided by the number of
tokens of these words. Each unique word in a text is considered a word type.
Each instance of a particular word is a token. For example, if the word dog appears in the
text 7 times, its type value is 1, whereas its token value is 7. When the
type-token ratio approaches 1, each word occurs only once in the text;
comprehension should be comparatively difficult because many unique words need
to be decoded and integrated with the discourse context. As the type-token ratio
decreases, words are repeated many times in the text, which should increase the
ease and speed of text processing. Type-token ratios are computed for content
words, but not function words. TTR scores are most valuable when texts of
similar lengths are compared.
Example:
Cytokinesis,
the second stage of cell division, begins to occur before mitosis is complete
(usually during telophase) and continues after the nuclei of the daughter cells
are completely formed. The preliminary steps of cytokinesis occur during the
growth interphases (called the G phases) of the cell cycle.
In these sentences (taken from the text reprinted later in
the help facility), the TTR for content words is 0.933. Words such as stage only occur
once, but words like cytokinesis and cell appear more than once. Coh-Metrix uses lexeme
versions in its calculation rather than lemma or stem versions; for example, cell is considered
different from cells.
4.4 Connectives
Many strands of cohesion are potentially important and
recognized in linguistics, discourse processing, psychology, education, rhetoric
and other fields that analyze text. Connectives are one important class of
signaling devices for particular categories of cohesion relations in text
(Halliday & Hasan, 1976). In dialogue, several classes of discourse markers
help connect the thread of conversation (Louwerse & Mitchell, 2003). The insertion of
connectives is known to have a substantial impact on comprehension and memory
for text (McNamara, Kintsch, E., Songer, & Kintsch, W., 1996).
Connectives are classified on two dimensions in Coh-Metrix
2.0. On one dimension, the
extension of the situation described by the text is determined.
Positive connectives extend events, whereas negative
connectives cease to extend the expected events (Louwerse,
2002; Sanders, Spooren & Noordman, 1992). Negative relations are synonymous
with adversative relations, as defined in Halliday and Hasan (1976).
Positive (p): and, after, because,
Negative (n): but, until, although
On another dimension, there are connectives associated with
the
type
of cohesion, namely additive, temporal, logical, and causal. Examples
are given below.
Additives (AD): also,
moreover, however, but
Causal (CA):
because, so, consequently, although,
nevertheless
Logical (LG): or,
actually, if
Temporal (TP): after,
before, when, until
All
connectives: CONi
(index 29)
This is the incidence of all connectives.
Coditional
operator: DENCONDi (index 30)
Number of conditionals, incidence score.
Positive
additive connectives: CONADpi
(index 31)
This is the incidence of positive additive connectives.
Positive
temporal connectives: CONTPpi
(index 32)
This is the incidence of positive temporal connectives.
Positive causal
connectives: CONCSpi
(index 33)
This is the incidence of positive causal connectives.
Positive logical connectives: CONLGpi
(index
34)
This is the incidence of positive logical connectives.
Negative additive connectives: CONADni
(index
35)
This is the incidence of negative additive connectives.
Negative temporal connectives: CONTPni
(index
36)
This is the incidence of negative temporal connectives.
Negative causal connectives: CONCSni
(index
37)
This is the incidence of negative causal connectives.
Negative logical connectives:
CONLGni
(index 38)
This is the incidence of negative logical connectives.
4.5 Logical
Operators
Logical operators are prevalent
in syllogisms and texts that express logical reasoning. They include the Boolean
operators (and, or, not, if, then) and a small number of other similar cognate terms. Texts with
a high density of these logical operators are difficult for most readers. To see
the logical operators used in Coh-Metrix click
here.
Logical
operators: DENLOGi (index
41)
This is the incidence of logical operators. Along with
"and" and "or"
, and
negations, a number of conditionals are also included. To see the conditionals
table, click here.
4.6. Sentence
syntax similarity
The sentence syntax similarity
indices compare the syntactic tree structures of sentences. The algorithms
build an intersection tree between two syntactic trees, one for each of the two
sentences being compared. An index of syntactic similarity between two sentences
is the proportion of nodes in the two tree structures that are intersecting
nodes.
Syntactic
structure similarity adjacent: STRUTa (index 24)
This is the proportion of
intersection tree nodes between all adjacent sentences.
Syntactic
structure similarity all 01: STRUTt (index 25)
This is the proportion of
intersection tree nodes between all sentences and across paragraphs.
Syntactic
structure similarity all 02: STRUTp (index 26)
This is the proportion of
intersection tree nodes between all sentences, but within paragraphs.
5. Referential and Semantic Indices
Referential cohesion occurs when
a noun, pronoun, or noun phrase refers to another constituent in the text. For
example, consider the sentence When water is heated, it boils. The word it refers to the
word water. A
referring expression (N) is the noun, pronoun, or noun-phrase that refers to
another constituent (C). C is designated as the referent of N. In the example
sentence above, the word it is the referring expression N, whereas the referent
C is the word water. In most cases, the referent C occurs in the text
prior to the referring expression N; this is known as anaphora. However, the
referring expression can also precede the referent constituent; this is known as
cataphora. An example of cataphora would be When it is heated,
water boils.
Referential cohesion has been
extensively investigated in the fields of text linguistics and discourse
processes, especially in the form of argument overlap (Kintsch & Van Dijk,
1978). Argument overlap occurs when a noun, pronoun, or noun-phrase in one
sentence is a coreferent of a noun, pronoun, or noun-phrase in another sentence.
The word argument is used in a special sense in this context, namely in contrast
to predicates in propositional representations (see Kintsch & Van Dijk,
1978). In this early work, two sentences were regarded as being linked by
coreference if they shared a common argument (i.e., an overlapping noun,
pronoun, or noun-phrase). However, the early theory was eventually expanded to
allow referential overlap between a {noun | pronoun | noun-phrase} N and a
referential proposition that has a similar morphological stem as a noun in N.
For example, consider the two sentences When water is heated, it
boils and eventually evaporates. When the heat is reduced, it turns back into a
liquid form. The heat in the
second sentence corefers to the proposition Water is heated; heat and heated share the
same morphological stem heat, even though one is a noun and the other is a
verb.
In addition to referential
indices, there are indices that assess the extent to which the content of
sentences or paragraphs are similar semantically or conceptually. Coherence is
predicted to increase as a function of similarity. One index of
semantic similarity is content word overlap, which is the proportion of content
words in two excerpts that share common content words. Another method of
computing similarity is through Latent Semantic Analysis (LSA). LSA is a
mathematical, statistical technique for representing world knowledge, based on a
large corpus of texts. LSA uses singular value decomposition, a general form of
principle component analysis, to condense a very large corpus of texts to
100-500 dimensions (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990;
Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). The conceptual
similarity between any two text excerpts (e.g., word, clause, sentence, text) is
evaluated by these 100-500 functional dimensions. There are many other
statistical metrics for computing similarity other than LSA (see Landauer,
McNamara, Dennis, & Kintsch, in press; Millis, Kim, Todaro, Magliano, Wiemer-Hastings, &
McNamara, 2004).
5.1 Anaphor
Adjacent
anaphor reference
: CREFP1u (index
7)
This
is the proportion of anaphor references between adjacent sentences.
Example:
There are four
distinct phases of mitosis called prophase ,
metaphase , anaphase , and telophase . These four phases are well known to
researchers who can easily observe them with, for
example, the simple light microscope.
In
this example, the pronoun them refers to phases in the previous sentence.
Anaphor reference: CREFPau (index 8)
This
is the proportion of unweighted anaphor references that refer back to a
constituent up to 5 sentences earlier.
5.2 Coreference
Coh-Metrix currently considers
three forms of coreference between sentences. For any two sentences s1 and s2,
if there exists a noun common to both, then the two sentences are considered noun
overlapped. If there exists two nouns, one from s1 and the other
from s2, that share a common stem, then the two sentences are considered noun
stem overlapped. If a noun from one sentence, s1, has a
stem that is shared by any category of word in the other sentence, s2, then the
two sentences are stem overlapped.
Adjacent
argument overlap: CREFA1u (index 9)
This is the proportion of
adjacent sentences that share one or more arguments (i.e., noun, pronoun,
noun-phrase).
Example:
Cell division
occurs to reproduce and replace cells. The division
of cells with a membrane-bound nucleus and
organelles (eucaryotic cells) involves two distinct but overlapping stages,
mitosis and cytokinesis
.
In this example, the word cells overlaps
between two adjacent sentences. This excerpt is part of the sample text printed
at the bottom of this help facility.
Argument
overlap:
CREFAau
(index
10)
This is the proportion of all
sentence pairs in a paragraph that share one or more arguments (i.e., noun,
pronoun, noun-phrase).
Adjacent stem
overlap: CREFS1u (index 11)
This is the proportion of adjacent sentences that share one
or more word stems.
Example:
The division of cells with a membrane-bound nucleus and
organelles (eucaryotic cells) involves two distinct but overlapping stages,
mitosis and cytokinesis. Mitosis occurs to replicate the cell's genetic material
in the nucleus, whereas cytokinesis occurs to divide
the gel-like liquid surrounding the cell's nucleus, called cytoplasm.
In this example, taken from the
sample text printed at the bottom of this help facility, the word division has a stem
overlap with divide.
Stem overlap:
CREFSau
(index 12)
This is the proportion of all
sentence pairs in a paragraph that share one or more word stems. When the text
is extremely long, it is not possible to compute this measure because the
computation requires a large amount of processing time.
Content word
overlap: CREFC1u
(index 13)
This is the proportion of
content words in adjacent sentences that share common content words.
See entire example text (click)
5.3 Latent
Semantic Analysis (LSA)
Latent Semantic Analysis (LSA)
is a mathematical, statistical technique for representing world knowledge, based
on a large corpus of texts. LSA uses singular value decomposition, a general
form of principle component analysis, to condense a very large corpus of texts
to 100-500 dimensions (Deerwester, Dumais, Furnas, Landauer & Harshman,
1990; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). The
conceptual similarity between any two text excerpts (e.g., word, clause,
sentence, text) is evaluated by these 100-500 functional dimensions. In
Coh-Metrix, therefore, the sentences, paragraphs and entire texts are
represented by LSA vectors of 100-500 dimensions. The "cosine" angle between
vectors is used to measure the similarity between excerpts. Text cohesion is
assumed to increase as a function of higher cosine scores between text
constituents.
There are several methods of
computing LSA cohesion, as specified below. Both the means and standard
deviations may be computed when several pairs of cosine similarity scores are
part of the computation. Coh-Metrix 2.0 reports means but not standard
deviations.
LSA sentence
adjacent: LSAassa
(index 14)
This index computes mean LSA cosines for adjacent,
sentence-to-sentence (abbreviated as "ass") units.
This
measures how conceptually similar each sentence is to the next sentence.
Example:
Text 1: The
field was full of lush, green grass. The horses grazed peacefully. The young
children played with kites. The women occasionally looked up, but only
occasionally. A warm summer breeze blew and everyone, for once, was almost
happy.
Text 2: The
field was full of lush, green grass. An elephant is a large animal. No-one
appreciates being lied to. What are we going to have for dinner tonight?
In
the example texts printed above, Text 1 records much higher LSA scores than Text
2. The words in Text 1 tend to be thematically related to a pleasant day in an
idyllic park scene: green, grass, children, playing, summer,
breeze, kites, andhappy, In
contrast, the sentences in Text 2 tend to be unrelated.
LSA sentence all:LSApssa
(index 15)
Like LSA sentence adjacent (LSAassa), this index computes
mean LSA cosines.
However, for this index all sentence combinations are considered, not
just adjacent sentences. LSApssa computes
how conceptually similar each sentence is to every other
sentence in the text.
LSA paragraph:
LSA
ppa (index 16)
This computes LSA cosines for paragraph-to-paragraph (pp)
units.
This
measures how similar paragraphs are to the other paragraphs in the text. This
measure cannot be computed for texts with only one paragraph and with texts that
ignore paragraph junctures.
Many aspects of a text can contribute to the
situation model (or mental model), the referential content or microworld
of what a text is about (Graesser, Millis, & Zwaan, 1997; Kintsch, 1998; van
Dijk & Kintsch, 1983). Text comprehension researches have investigated at
least five situational dimensions (Zwaan & Radvansky, 1998): causation,
intentionality, time, space and protagonists. All of these situational
dimensions can be indicated in a text by connectives, particles, nouns and
verbs. In Coh-Metrix 2.0, the protagonist dimension is not analyzed.
6.1 Causal
dimension
Causal cohesion reflects the
extent to which sentences are related by causal cohesion relations. Causal
cohesion relations are appropriate when the text refers to events and actions
that are related causally, as in the case of science texts with causal
mechanisms and stories with an action plot. Causality is not relevant, for
example, in texts that describe static scenes and texts that convey abstract
logical arguments.
Coh-Metrix 2.0 needs to first
estimate how much of the text refers to events and actions that may be part of
causal content. This is accomplished by counting the number of main verbs that
are causal, based on WordNet (Fellbaum, 1998;
Miller, Beckwith, Fellbaum, Gross, & Miller,
1990)
. WordNet is a lexical database
that contains a large number of semantic characteristics of words. A verb is
classified as "causal" if it belongs to certain WordNet categories. The higher
the incidence of causal verbs in a text, the more the text is assumed to convey
causal content.
Having causal
verbs in a
text does not insure that the reader can connect these events and actions with
causal relations. According to Coh-Metrix 2.0, causal cohesion relations are
signaled by causal particles (click to see causal particles). Some causal particles are
conjunctions, transitional adverbs, and other forms of connectives, such as since, so that, because, and consequently. These
particles are used to indicate some causal relationship between clauses that
refer to events and actions. Other causal particles consist of a small number of
verbs that explicitly assert there is a causal relationship between
constituents, without specifying the nature of the causal content: cause, enable, make. It should be
noted that and
and or are not
classified as causal particles.
Causal content:
CAUSVP (index 20)
This is the incidence of causal verbs and causal particles
in text.
Causal
cohesion: CAUSC (index 21)
This is a ratio of causal particles (P) to causal verbs
(V). The denominator is incremented by the value of 1 to handle the rare case
when there are 0 causal verbs in the text. Cohesion suffers when the text has
many causal verbs (signifying events and actions) but few causal particles that
signal how the events and actions are connected.
To see the list of causal particles
click here
6.2 Intentional dimension
Intentional cohesion reflects
the extent to which sentences are related by intentional cohesion relations.
Intentional cohesion relations are appropriate when the text refers to animate
protagonists who perform actions in pursuit of goals, as in the case of simple
stories and other forms of narrative (Singer & Halldorson, 1996; Van den
Broek & Trabasso, 1996). Intentionality is not relevant, for example, in
texts that describe events that are not goal directed and not executed by
animate agents (e.g., mechanisms that cause volcanoes).
In Coh-Metrix 2.0, the incidence
of intentional actions and events is estimated by counting the number of main
verbs that are intentional, based on WordNet (Fellbaum, 1998;
Miller, Beckwith, Fellbaum, Gross, & Miller, 1990), and
that are performed by animate subject nouns (according to WordNet).
WordNet is a lexical database that contains a large number
of semantic characteristics of words. A verb is classified as "intentional" if
it belongs to particular WordNet categories. The higher the incidence of
intentional actions in a text, the more the text is assumed to convey
goal-driven content.
Intentional cohesion is the
ratio of intentional particles (e.g., in order to, so that, for
the purpose of, by means of, by, wanted to) to the incidence of causal content.
Intentional content: INTEi (index
22)
This is the incidence of intentional actions, events, and
particles (per thousand words).
Intentional cohesion: INTEC (index
23)
This is the ratio of intentional particles to intentional
actions/events.
6.3 Temporal dimension
Temporal cohesion reflects the
extent to which sentences are related by temporal cohesion relations. Temporal
cohesion relations are appropriate when the text refers to events or
actions. The
actions and events may be articulated in different tenses (past, present,
future) and different aspects (e.g., in progress, completed, vs. static). For example
X died is in the
past tense and completed, X is dying is present and in progress, X
is dead is
static, whereas X will have died is future and completed. . Temporal cohesion is measured by the
repetition scores when analyzing the sequence of verbs that are classified in a
particular tense and aspect.
Temporal cohesion: TEMPta (index
27)
This is the repetition score for tense and
aspect. The repetition score for tense is averaged
with the repetition score for aspect.
6.4 Spatial dimension
Spatial cohesion reflects the
extent to which the text has spatial content and the sentences are related by
spatial particles or relations. Spatial content includes location nouns and
prepositions, as well as motion actions and prepositions, based on WordNet
(Fellbaum, 1998;
Miller, Beckwith, Fellbaum, Gross, & Miller,
1990)
. WordNet is a lexical database
that contains a large number of semantic characteristics of words.
Location nouns are defined by WordNet as locational, such
as Memphis, place, Central
Park. Location prepositions
(in, by, near), deictic references (here, there), and other particles play a role in relating the nouns in
space. Motion
verbs (go, run, drive) are related by particles that refer to spatial and
deictic indexes (from, to, through, by, between, here,
there).
In Coh-Metrix 2.0, the location
ratio score is the incidence of location prepositions (LSP) divided by LSP plus
the incidence of location nouns. The motion ratio score is the incidence of
motion particles (MSP) divided by MSP and the incidence of motion verbs.
Location cohesion is the average of these two ratios.
Spatial cohesion
: SPATC (index
28)
Mean of location and motion ratio scores. The location ratio score is the incidence of location prepositions (LSP) divided by LSP plus the incidence of location nouns. The motion ratio score is the incidence of motion prepositions (MSP) divided by MSP and the incidence of motion verbs.
IV
References
Herdan, G. (1960). Type
token mathematics: A textbook of mathematical linguistics. Gravenhage, Mouton.
Baayen, R. H., R. Piepenbrock, and H. van Rijn (Eds.)
(1993). The CELEX Lexical Database (CD-ROM). University of Pennsylvania, Philadelphia (PA):
Linguistic Data Consortium.
Brill, E. (1992). A simple
rule-based part of speech tagger. In Proceedings of the Third
Conference on Applied Natural Language Processing, ACL.
Coltheart, M. (1981). The MRC psycholinguistic database quarterly. Journal of Experimental Psychology, 33A, 497-505.
Deerwester, S. S., Dumais, T.,
Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent
semantic analysis. Journal of the American Society For
Information Science, 41,
391-407.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Graesser,A.C., McNamara,D.S.,& Louwerse,M.M (2003). What do readers need to learn in order to process coherence relations in narrative and expository text. In A.P. Sweet and C.E. Snow (Eds.), Rethinking reading comprehension (pp. 82-98). New York: Guilford Publications.
Graesser, A. C., McNamara, D. S., Louwerse, M. M., &
Cai, Z. (2004).
Coh-Metrix: Analysis of text on cohesion and language. Behavior
Research Methods, Instruments, and Computers 36, 193-202.
Graesser, A.C., Millis, K.K., & Zwaan, R.A. (1997). Discourse comprehension. In J.T. Spence, J.M. Darley, and D.J. Foss (Eds.), Annual Review of Psychology, Vol. 48. Palo Alto, CA: Annual Reviews Inc.
Haberlandt, K., & Graesser,
A. C. (1985).
Component processes in text comprehension and
some of their interactions.
Journal of Experimental Psychology: General, 114,
357-374.
Halliday, M. A. K. & Hasan,
R. (1976). Cohesion in English. London : Longman.
Just, M. A. & Carpenter, P. A. (1980). A theory of reading: From eye fixations to comprehension. Psychological Review, 87, 329-354.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge: Cambridge University Press.
Kintsch, W., & Van Dijk, T.
A. (1978). Toward a model of text comprehension and production. Psychological
Review, 85, 363-394.
Klare, G. R. (1974-1975).
Assessing readability. Reading Research Quarterly, 10, 62-102.
Knott, A. & Dale, R. (1994). Using linguistic phenomena to motivate a set of rhetorical relations. Discourse Processes, 18, 35-62.
Landauer, T. K., & Dumais,
S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory
of the acquisition, induction, and representation of knowledge. Psychological
Review, 104, 211-240.
Landauer, T. K., Foltz, P. W.,
& Laham, D. (1998). An introduction to latent semantic analysis. Discourse
Processes, 25, 259-284.
Landauer, T., McNamara, D.,
Dennis, S., & Kintsch, W. (in press)(Eds.), LSA: A Road to
meaning.
Mahwah, NJ: Erlbaum.
Louwerse, M.M. (2002). An analytic and cognitive parameterization of coherence relations. Cognitive Linguistics, 291?15.
Louwerse, M.M., & Mitchell,
H.H. (2003).
Toward a taxonomy of a set of discourse markers in dialog: A theoretical
and computational linguistic account.
Discourse Processes, 35, 199-239.
Martin, J. R. (1992). English
text: System and structure. Amsterdam : Benjamins.
McNamara, D.S., Kintsch, E., Songer, N.B., & Kintsch, W. (1996). Are good texts always better? Text coherence, background knowledge, and levels of understanding in learning from text. Cognition and Instruction, 14, 1-43.
McNamara, D. S., Louwerse, M. M.
& Graesser, A. C. (2002). Coh-MetrixCoh-Metrix: Automated
cohesion and coherence scores to predict text readability and facilitate
comprehension. Institute
for Intelligent Systems, University of Memphis, Memphis, TN.
Miller, G.A, Beckwith, R.,
Fellbaum, C., Gross, D. & Miller, K. (1990). Five papers on WordNet. Cognitive Science
Laboratory, Princeton University , No. 43.
Millis, K. K.,
Kim, H. J., Todaro, S. Magliano, J., Wiemer-Hastings, K., & McNamara, D.S.
(2004). Identifying reading strategies using latent semantic analysis: Comparing
semantic benchmarks. Behavior Research Methods, Instruments, & Computers,
36, 213-221.
Myers, JL & O'Brien, EJ
(1998). Accessing the discourse representation during reading. Discourse
Processes, 26, 131-157.
Sanders, T. J. M., Spooren, W.
P. M., & Noordman, L. G. M. (1992). Toward a taxonomy of coherence
relations. Discourse Processes, 15, 1-35.
Sanders, T. J. M., Spooren, W.
P. M., & Noordman, L. G. M. (1993). Coherence relations in a cognitive
theory of discourse representation. Cognitive Linguistics, 4, 93-133.
Sekine S., & Grishman R.
(1995). A corpus-based probabilistic grammar with only two
non-terminals. In the
Fourth International Workshop on Parsing Technology.
Singer, M., & Halldorson, M. (1996). Constructing and validating motive bridging inferences. Cognitive Psychology, 30, 1-38.
Templin, M. C. (1957). Certain
language skills in children, their development and interrelationships.
Van den Broek, P., &
Trabasso, T. (1986; Causal networks versus goal hierarchies in summarizing
texts. Discourse
Processes, 9, 1-15.
Van Dijk, T.A., & Kintsch,
W. (1983). Strategies
of discourse comprehension.
Zwaan, R. A., & Radvansky,
G. A. (1998). Situation models in language comprehension and memory.
Psychological Bulletin, 123,
162-185.
Zwaan, R. A., Langston M. C., & Graesser, A. C. (1995).
The construction of situation
models in narrative comprehension: An event-indexing model. Psychological
Science, 6,
292-297.