readability formulas Coh-Metrix & Coh-Git Methods Timeline
Coh-Metrix automatically determines how text elements and constituents are
connected for specific types of cohesion. Suppose there are (2 x 2 x 5 =)
20 types of cohesion (local and global, vocabulary- and grammar-driven, referential,
locational, temporal, causal, and structural). Suppose that there are N elements
and constituents in a particular text. There would be N x (N-1) directional
cohesion connections and [N x (N-1)/2) bidirectional connections with respect
to any one of the 20 cohesion relations. We can capture the resulting set
of connections in the form of a matrix (Graesser, Karnavat et al., 2000; Kintsch,
1998; Zwaan et al., 1995). A cell in the matrix is 0 if there is no particular
relation between a pair of elements/constituents, a 1 if a solid relation,
and intermediate values if there is a non-discrete metric. The full matrix
has the entire set of connections with respect to a type of relation R, whereas
the contiguous sub-matrix includes only those elements/constituents that are
contiguous in the explicit text. We can define a density measure as the summation
of such cell values. For example, the causal cohesion density would be computed
as:
[S R(ei, ej)]/[N x (N-1)]
which is the average cell value for pairs of cells with respect to the causal relation. One could restrict this to a contiguous, bi-directional, causal cohesion density, which only considers the contiguous elements and constituents, computed as
[S R(ei, ej | ei & ej are contiguous)]/[N+1)]
When integrating over all types of cohesion markers, one can compute an overall
cohesion density score for a particular text. More importantly, however, these
density scores would specify how much an entire text has cohesion markers
with respect to a particular type of relation. For instance, it might be the
case that different readers (low vs. high knowledge, low vs. high reading
ability) rely more on one type of cohesion relation than another.
A fine-grained recall analysis can be used to test the validity of the coherence
and cohesion metrics produced by Coh-Metrix. Suppose that a recall protocol
is collected from a sample of subjects on a text with N elements/constituents.
Each cell in the matrix would have different types of recall measures. For
example, a recall proportion is the proportion of observations in which both
nodes ei & ej are present in a particular recall protocol, with values
varying from 0 to 1. A multiple regression analysis can indicate whether the
[N x (N-1)]/2 recall proportions are predicted by each of the cohesion matrices
as predictor variables (i.e., one for referential, one for causal, etc.).
There is a large number of cells in such analyses; if N = 40, then the number
of cells is 780. Similarly, we can do this for contiguous recalls, for distance
in recall order, for time measures, and so on.
This method would give us overall weights for cohesion relations, which can then be but in a formula to determine readability, comprehension, learning, and appropriateness scores. However, as we mentioned before, different readers might rely on different cohesion relations. Therefore, in addition to these overall scores, we want to include differences between readers. If the user has information on the knowledge of the reader (high – low) or reading ability (high – low), this information can be entered in the computational model to tailor the formula to particular groups of readers, thus increasing the accuracy of the scores.
In addition to Coh-Metrix, we will develop Coh-GIT to analyze where the relevant cohesion relations are located in the text. This way, writers and educators cannot only predict the readability, comprehensibility, learnability, and appropriateness of a text for a particular reader group but also improve the problematic aspects of that text. Zwaan et al. (1995) has reported that reading times for sentences in narrative texts increase robustly with the number of coherence categories that have breaks in continuity (i.e., the current sentence being read is not coherently related to the previous context with respect to a particular coherence category).