Coh-Metrix & Coh-Git



readability formulas    Coh-Metrix & Coh-Git    Methods    Timeline

Coh-Metrix automatically determines how text elements and constituents are connected for specific types of cohesion. Suppose there are (2 x 2 x 5 =) 20 types of cohesion (local and global, vocabulary- and grammar-driven, referential, locational, temporal, causal, and structural). Suppose that there are N elements and constituents in a particular text. There would be N x (N-1) directional cohesion connections and [N x (N-1)/2) bidirectional connections with respect to any one of the 20 cohesion relations. We can capture the resulting set of connections in the form of a matrix (Graesser, Karnavat et al., 2000; Kintsch, 1998; Zwaan et al., 1995). A cell in the matrix is 0 if there is no particular relation between a pair of elements/constituents, a 1 if a solid relation, and intermediate values if there is a non-discrete metric. The full matrix has the entire set of connections with respect to a type of relation R, whereas the contiguous sub-matrix includes only those elements/constituents that are contiguous in the explicit text. We can define a density measure as the summation of such cell values. For example, the causal cohesion density would be computed as:
[S R(ei, ej)]/[N x (N-1)]

which is the average cell value for pairs of cells with respect to the causal relation. One could restrict this to a contiguous, bi-directional, causal cohesion density, which only considers the contiguous elements and constituents, computed as

[S R(ei, ej | ei & ej are contiguous)]/[N+1)]

When integrating over all types of cohesion markers, one can compute an overall cohesion density score for a particular text. More importantly, however, these density scores would specify how much an entire text has cohesion markers with respect to a particular type of relation. For instance, it might be the case that different readers (low vs. high knowledge, low vs. high reading ability) rely more on one type of cohesion relation than another.
A fine-grained recall analysis can be used to test the validity of the coherence and cohesion metrics produced by Coh-Metrix. Suppose that a recall protocol is collected from a sample of subjects on a text with N elements/constituents. Each cell in the matrix would have different types of recall measures. For example, a recall proportion is the proportion of observations in which both nodes ei & ej are present in a particular recall protocol, with values varying from 0 to 1. A multiple regression analysis can indicate whether the [N x (N-1)]/2 recall proportions are predicted by each of the cohesion matrices as predictor variables (i.e., one for referential, one for causal, etc.). There is a large number of cells in such analyses; if N = 40, then the number of cells is 780. Similarly, we can do this for contiguous recalls, for distance in recall order, for time measures, and so on.

This method would give us overall weights for cohesion relations, which can then be but in a formula to determine readability, comprehension, learning, and appropriateness scores. However, as we mentioned before, different readers might rely on different cohesion relations. Therefore, in addition to these overall scores, we want to include differences between readers. If the user has information on the knowledge of the reader (high – low) or reading ability (high – low), this information can be entered in the computational model to tailor the formula to particular groups of readers, thus increasing the accuracy of the scores.

In addition to Coh-Metrix, we will develop Coh-GIT to analyze where the relevant cohesion relations are located in the text. This way, writers and educators cannot only predict the readability, comprehensibility, learnability, and appropriateness of a text for a particular reader group but also improve the problematic aspects of that text. Zwaan et al. (1995) has reported that reading times for sentences in narrative texts increase robustly with the number of coherence categories that have breaks in continuity (i.e., the current sentence being read is not coherently related to the previous context with respect to a particular coherence category).