Mary Smythe was a woman in her forties who had the mental age of
a 12-year-old. Friendly and outgoing she was popular with her mother’s
neighbours in the northern town of Moorheed. Unfortunately, her friend-
liness seems to have been her downfall because one day she returned
home to complain that an elderly male neighbour had sexually assaulted
or raped her. Several of her neighbours had indeed seen her going into Joe
Brown’s house, but they would have had difficulty in believing that the
elderly cat-loving Mr Brown would do such a thing. Nevertheless, the mat-
ter was reported to the police and Mr Brown was interviewed. Eventually,
charges were laid against him. Soon afterwards the neighbours who had
reported seeing Mary enter his house on the day of the alleged attack
began to receive malicious and threatening letters. They reported receipt
of these letters to the police. Other individuals, who were not witnesses,
knew of several visits by the victim to the suspect’s house, or knew of
other information which was damaging to the suspect, or perceived by
him to be damaging. These people also received anonymous letters. All
of the anonymous letters contained claims, supposedly made by a ‘friend’
of the suspect, that he, the suspect did not rape the victim.
At about the same time Joe Brown himself wrote several letters to a
number of the witnesses and others, effectively making the same claims
as the questioned author/s, and in very similar language. Below are some
examples of the language of both text groups, the known (K) and the
questioned (U). There are two known examples (K1 and K2) and two
unknown examples (U1 and U2).
Application of ‘markedness’
Table 19.2. Corpus results for phrases like ‘am and was’
String Frequency
was and is 60
is and was 2
am and was 1 (two clauses)
was and am 0
that the results you get have to be very significant to be used at all, for
example, 10,000 vs. 50 is a lot more valid than 10,000 vs. 9,000. Where
you cannot use a corpus but have to use the internet, for example, if you
are searching for a function word, try to triangulate your results using
different constructions.
One search engine 2 gives the following counts for various ver-
sions of pronoun + be (present) + be (past), and pronoun + be (past) +
be (present), as shown in Table 19.3. From these measurements we see
that the past–present formulation is much more frequent than the
present–past formulation. There appears to be a strong tendency for
the past tense verb to precede the present tense verb. Even if we can-
not be sure of the precision of search engine frequencies, the search
still seems valid, given that the search has been widened to test other
pronouns than the ones we are interested in. In addition, I suggest
that intuition tells us that the past–present sequence is more typi-
cal than the present–past sequence, and also that if we are project-
ing (‘I thought’, ‘he said’, ‘I believed’ etc.) in the past tense then it
would be more usual to refer to the past or earlier event adjacent to
the present or later event. In both the known and questioned texts,
the kind of disordered sequence I have referred to above occurred
on a number of occasions, including such examples as ‘the proof and
evidence was given to the police’, ‘in the past years and months’ and
so on. In each case the sequence was disordered: thus, ‘proof and evi-
dence’ defies the expectation that ‘evidence’ comes before ‘proof’: we
have to collect the one to establish the other. Similarly, in ‘the past
years and months’ we are in a disordered sequence because ‘months’
go to make up ‘years’. The reader can verify this information using
both internet (Google™) and corpus searches (e.g. BNC lists [‘months
and years’: 30 – ‘years and months’: 1]).
The police officer dealing with the case, stated that he did not
believe me.
. . . no proof of sexual activity having taken place, between Mary
and myself.
I and my partner, have over the past years and months, studied the
case.
The break which occurs here is based on the notion that ‘inter course’
is two words. It is a disordering of the structure of the lexis. For the rele-
vant meaning, the lexis does not store ‘inter course’ but ‘intercourse’.
The lexis probably stores, but for quite a different reason, a collection of
prefixes in English, for example, ‘inter’ as in ‘international’, and of course
it stores ‘course’. Hence two storages – ‘inter’ and ‘course’ are possible,
but not in the present context. What we essentially have, I am propos-
ing, is a disorder of the structure of the lexis, seen in both K and Q,
other examples of which include: ‘photo copied’, ‘back fired’, ‘other wise’
and ‘some where’. As with the punctuation breaks in the sentence, we
are seeing here a break – a word space – where one is not expected or
licensed by the grammar.
There were also many spelling similarities across K and Q, for example,
‘intensions’, ‘coarse’ instead of ‘course’, a number of identical or near
identical phrases, for example, ‘for a so-called rape’ vs. ‘for the so-called
rape’, and so on. However, the spelling errors require a different kind of
categorization than the other categories in terms of markedness and I
will not be dealing with this topic here. The examples I have given here
are simply a small sample of the similarities I found across the texts.
In each case I prepare a ‘match table’ which lists how the known and
questioned texts match each other with respect to a particular type of
similarity. Here is an example as shown in Table 19.4.
Table 19.4. Sample table of matches across known and unknown texts
Some types of matches are more important than other types. Thus,
exact matches of long clauses or whole sentences would have more value
than, for example, misspelling matches. Document layout and style
would have the least value, even less than orthographic (spelling, apos-
trophe, capitalization) matches. The point about matching character-
istics is that each claimed match needs to be based on the concept of
markedness, by which I mean a significant systematic difference from
the norm. For example, if a candidate and the known author both spell
‘receive’ as ‘recieve’ this may indicate markedness, but it is not likely
to be significant markedness. About 2 per cent of the population spell
‘receive’ incorrectly. It is not that unusual. When we look for marked-
ness, we need to determine whether we are dealing with structures which
are unusual or structures which are unconventional. These are two types
of markedness.
With syntactic markedness we are on surer ground than studying
features at a more superficial level, such as document layout: we know,
as native speakers, that in English the form ‘dog the’ is marked because
it violates a simple, easily observable, grammatical rule. Moving
through the layers, with semantic or idiomatic markedness we are on
less sure ground simply because these are not at the same layer of lin-
guistic organization as syntactic structures (see Chaski 2001: 40). In
the present case, it was my opinion that both known and questioned
writers shared the same tendencies, namely, to disorder chronology,
syntax and idiom. Problems or symptoms of very similar types pervade
both sets of texts. The next question is calculating the similarities and
differences. To do this, and following the work of the noted linguist
Carole Chaski3 I compute the number of marked constructions for
each type of phenomenon, syntactic and otherwise. It is also import-
ant to list counter-examples. Recall, that it is not the linguist’s task
to prove anything. The task is one of discovery and demonstration,
not proof.
How do forensic linguists judge their results? In the same way any
other scientist does. If you have adequately tested your results and your
method and you are able to quantify them statistically, you can repose a
high level of confidence in them. I am using the word ‘confidence’ here
in a non-technical sense. It is not the same thing as a statistical ‘con-
fidence level’. ‘Confidence level’ in statistics means the error rate you
assign as the decision point for claiming an authorship match or, rather,
a valid authorship comparison. One tool I use to assist in reaching a
judgement is an order-of-importance table as shown in Table 19.5. This
gives hierarchical order to markedness.
What this table says is that if we find two texts, supposedly independ-
ently produced, with a significant number of syntactic matches, then
this is likely to be of greater significance than, say, matches relating
to document layout. Finally, my last step is to apply a scale of judge-
ments. Typically, linguists use a 10 point scale, though some linguists
now favour using more complex statistical calculations, such as Bayesian
likelihood ratios. At the end of the process the linguist offers an opinion
as to whether there is a significant degree of similarity between two sets
of text. In the present case I felt fairly certain of my conclusion and was
able to assign an 8 out of 10 probability to my opinion. It is always better
to be cautious. A scaled probability of 8 is quite high. It would be rare to
apply a 9, and unheard of to apply a 10.
I handed in my report to the court and in due course the suspect was
confronted with it. After about a month, the suspect appeared in front
of the court continuing to deny the charge of rape but admitting to
sexual assault. As I am sure the reader is aware, in order to protect the
victim, I cannot be any more specific as to the details, but the work that
Notes
1. The British National Corpus, a large body of language of 100 million words,
compiled by researchers at the University of Oxford.
2. Google, June 2007
3. Chaski, C. (1998). ‘A Daubert-inspired assessment of current techniques for
language-based author identification’, US National Institute of Justice.