Anda di halaman 1dari 15

Lancaster University | In conversation with Mike Scott

OK. It's a great pleasure this afternoon to be talking to Dr. Mike Scott from the University of Aston.

Thank you very much.

Now, Mike, of course, you've got a long history in all sorts of areas, including language teaching, but

maybe, principally for the purpose of this discussion, we can talk about your experience in building

corpus search tools. Could you maybe tell me when you started doing that?

It goes right back, yeah--

It does, doesn't it?

It was an accidental thing, really. I fell into this by accident, because I was living in Mexico back in the
1970s--

This is going far back.

We're going a long way back, yes, and I bought one of those programmable calculators. And we lived

near the border with the USA, and I was able to go to the USA and buy a programmable calculator and
then work my way through the manual for it. It was real Mickey Mouse stuff. You kind of typed in things,
and then you could programme a moon landing or it came out.

Was it Texas Instruments?

Texas Instruments, yeah.

I had a Sinclair Enterprise, which was the British version of that, of course.

OK. And I read the manual, which I never normally do. I'm not a manual reader at all by nature. But I
read the manual of this thing, it had diagrams of trains on tracks and learning to programme. The idea
was that you were programming something that would be running along a track, and it could go around

a loop, or it could do all sorts of things and end up making some calculations. And it was just an
incidental purchase.

I'm not a physicist, and I wasn't really using the calculator very much, other than sort of the normal
1
calculations people do with calculators. But it woke my interests a little bit. And at the university I worked
in northern Mexico, they asked all the staff in the humanities to take a course in Statistical Package for

the Social Sciences.

And so we had to go to the cafeteria and buy these old-fashioned cards, Hollerith Cards, which you
bought in a packet of 100. And then you put one in a kind of typewriter thing. And we were taught to
type out the steps you had to do to run this package on the mainframe computer.

Were they punch cards?

They were punch cards, yeah, that's right. And so you'd type this stuff, and you had to leave 15 spaces
at the beginning and start typing. If you typed on the 14th space, then throw the card away and do it
again. So every single typing mistake cost you a card.

And so you kind of typed something, and these were instructions to a big powerful computer which you
never really saw, which told it to do something probably pretty Mickey Mouse. And then, after much
struggle, because I'm not a typist at all, you'd get the thing typed through and the cards with their little

perforations, and you'd submit it. And you'd come back the next day or the next week or whenever it
was and you'd be told, well, it failed on the third step. And you've got to replace the card, anyway, all of

that process.

But it was interesting that the university thought it a good idea for all of us humanities people to kind of

learn how to use computers. This is back in the 1970s. This is before computers became available

really for ordinary people.

Then very soon after that, computers did become available. And you could see these ones,

Commodore PETs and things with sort of a ping pong game with a white dot moving across the screen
and batting left and right and all that. I didn't have one. But when I went to England in 1982, I bought a

computer.

I bought one of those first computers. And it was Tim Johns who told me what to buy. He said OK, get
yourself a NewBrain. It was called NewBrain, the computer, and it was a very good one.

But there was no manual other than the one that came with the thing. There were no books. There

were no software. There was nothing to go with it.

2
Just the headaches with NewBrain.

I bought a computer and the book and a printer. I took it off to Brazil, and where I was living in Brazil,
there was no way that I could, there was nobody to learn from. It was either, oh dear, expensive

mistake. Put it in the cupboard.

Or do something about it.

Or play around with it and try to work out what it could do. And there were some sample programmes,

and there's probably about 150 pages of manual with it. And it was rather a well-produced one, actually.

And I worked my way through the book.

And there were all these very mysterious things I didn't understand, quite technical things about poking

and peeking and God knows what, and they had me perplexed. And my son was at school, and he
came back in the holidays, and he said, in our school we've got computers. And when you type, when

you get to the end of the line, the thing goes back to the beginning of the next line. So you can type a

letter and you don't have to sort of--

No carriage return.

Yes, that's right. It just shoots right to the left. And I was trying to write a word processor using this really
very, very primitive computer. I taught myself to find a way to get the thing--

I thought, oh, the machine won't be quick enough to go around. How can it possibly manage that? And

the only way of doing it was by using these arcane instructions, so it was a bit like the early calculator
stuff.

But anyway, one thing led to another, and Tim Johns was the chap who told me about buying the

computer in the first place, and I struggled with it. And then Tim Johns came out to Brazil again a few
years later, and by that time, I'd written programmes, but all this just merely as a hobby, like somebody

might be interested in crossword puzzles or Sudoku's or skiing or whatever else. I was a sailor, but I
also enjoyed doing this thing.

And Tim Johns came by, and he said, well, I've done a programme. And I'd like you to-- I've given it to

the university in San Paolo and some other universities, and I wonder if you could give support on it for

3
me. So he showed me how it worked. And I didn't know at the time that giving support to somebody
else's programme is actually quite a difficult thing to do.

The programme has to A, work, and B, occasionally perhaps needs some help or fail, but you have to
have a knowledgeable person who can say what to do.

Yes, and it has to survive the encounter with the user who often does things that you'd never imagine

they could do. Interestingly, there are so many parallels between your experiences and my own. For
me, it was the Sinclair QL that I started programming with.

Quantum Leap.

Quantum Leap it was called, and that SuperBASIC, and started to fiddle around in my last year of my
Linguistics degree, started to fiddle around. You got the sense, by the sound of it you had the same,
you got the sense that you could do something with language on this--

Exactly, yeah.

That you hadn't been able to do previously. Apart from anything else, typing on it was much nicer than
using a manual typewriter, which used to hurt my fingers.

Well, it was this much. Have you made a mistake? On a computer, you could go back and correct the

mistake. On a piece of paper, on a printer, or a typewriter, the mistake would come out.

I remember using a word processor for the first time. It was a package called Quill. And you could buy a
spell-checker to put into it. And all of a sudden, you started to see your spelling errors and the

corrections. But you also started to see potentially that you could start to examine this in a little more
detail.

And I remembered doing a sort of very primitive version of corpus annotation for one of my

assignments in my third year, looking at cohesive ties taking Halliday and Hasan, and applying it to a
text and actually putting it all in using this mark-up system within the word processor. And you suddenly
thought, you know you can suddenly do a lot more with these things in terms of studying language than
you could have done previously. Anyway, we're in the jungle in Brazil, and you're helping Tim Johns.

Not quite jungle, but yes, that is right. So I had written a word processor. And then I couldn't really give

4
adequate support to Tim Johns' programme, which he'd already got. Oxford University Press had
already agreed to publish.

Oh, was this WordCruncher then?

No, it was MicroConcord.

MicroConcord, yes.

It was an early version of MicroConcord. So Tim went back to Birmingham, and the programme was
supposed to be produced by OUP, but it kept on not coming out. It was always in the catalogue, but not
actually coming out.

And it worked. It worked pretty well. But there were some things that it didn't quite do right. I decided to
design my own concordancing software to do the same job. But not in any way thinking that this would
be a rival to it, but just almost like a--

A hobby or an experiment?

A hobby, yes, that's right, so an experiment, yeah. So I did this, and I sent Tim a copy. And Tim said,
oh, good, what we should do now is work together and incorporate the best of both really.

That's so generous of Tim.

Yes, that's right, very generous. And we did. We produced the thing together. Both of our names were
on the product.

And eventually in '93, that came out as a piece of software which Oxford University Press produced.

And it was all out there and so on. It had a concordancer in it.

Yeah, I remember it coming out, because at the time, I had just started teaching. I was lecturing using
the Longman Mini Concordancer, which was good, but it had huge drawbacks. I think you could get a

maximum of about 20 to 30,000 words--

A small amount of text, yeah.

I heard about this new programme, and I thought, well, that sounds better.

5
And in those days, of course, we're talking about computer chips that were slow, and the memory was
very expensive. And computers didn't have any power really. I mean, of course, they're still on moon
satellites and technology that's out there in space, '70s and '80s stuff, with a lot low power.

I think the shuttle ran on that.

That's what I've got inside my phone. Yes, indeed, but of course, luckily the technology has kind of leapt
forwards all the time. And so what was incredibly difficult, you had to be very careful to not waste too
much space and write the programme as efficiently as you could and keep everything compact. And of

course, now that doesn't hold anymore, and it's possible to bring in many more possibilities into the
software.

So what did you move onto after that?

That grew, because it was a concordancer, and it didn't have a word listing facility, which I think
Longman Mini Concordancer did have.

It did, yeah.

And there were some obvious directions that you should take. And at that time, I had a student called
Fadula, I think, from Malaysia. And she was interested in studying the vocabulary of her textbooks.

And what she wanted to know was, is the vocabulary which appears in unit two actually really repeated

again in unit five and six or 22 or something? Does the poor learner get a chance to practice his
vocabulary again? Is there a systematic way in which vocabulary once presented gets reused? And the
obvious thing to do was to have a word list of unit one and a word list of unit two and so on down the
line.

Compare and contrast.

And so you needed a word listing programme. So I said, OK, I'll write you a word listing programme. It
wasn't anything for publication. It was just going to be a programme for Fadula to use for her master's

work.

That eventually kind of built up impetus, and it turned out not to be too difficult to write a word listing
programme which would know which were the most frequent words and then could show them in
6
alphabetical order. And so there was a need for that, and that matched in nicely with the concordancing

possibility.

And by '96, that had become WordSmith Tools. So I launched the software as version one of

WordSmith Tools with Oxford University Press again, in 1996. And it had three basic functions. It had a
concordancer, which was very derived from the work I'd done with MicroConcord, plus a word lister,

plus a key words analysis tool which was something that I'd again discovered by mistake or by accident.

It was there. It's a bit of a funny story. I'd agreed to give a talk at the university the way all staff were

expected to. So it was my turn to give the talk on Tuesday.

And on Saturday morning, I started thinking, what am I going to say? And I've got this word listing

programme. What I thought was, it would be quite interesting if I tried to take a word list and then relate
it to another word list.

And I've got a little word list, and it was the racehorses from the newspaper. It was to do with horse
racing and which horses had won races. And I've got other word lists that I'd made, including a whole

collection of stuff from The Guardian newspaper-- my brother worked for The Guardian--

That's good.

That helped me to get sources of electronic text. And, as an experiment on that Saturday morning, I

compared a little word list of the one text I had with the collection I'd got. And suddenly, I found out that
some words were leaping out at me, and there was a comparison to be made.

And that was how the key words-- it was using the chi-squared statistical calculating, which is a very
simple and easy one to understand. And it made the comparison word by word for all the words of the

little text and said, is this word somehow outstanding by comparison to the collection I'd got?

The genesis of it all.

And that's how it started. And so then, by Saturday afternoon, I realised that I was on to something

interesting. But I was getting perplexed, because what was happening was words like sunshine or star

were coming out as key or coming out as prominent in this list, because they were names of
racehorses.

7
Now, the text wasn't about stars or sunshine at all. It wasn't the weather. But I realised that there'd have

to be certain thresholds, so a word had to come more than once, for example. And this idea of what

would count as being outstanding was a sort of problem.

That's a lovely little vignette from history, the birth of key words. Of course, many of the issues you
outline are still issues that are with us today. What do you count?

How frequent is frequent enough? What measure do you use to decide whether something's worth
looking at and is indeed key. So it's interesting, actually, how often when an idea is born, it's born with

all of its problems.

And the problems go on, the problems don't get solved effectively.

Well, I certainly remember WordSmith coming out, because I think we had you come up here to talk

about it quite soon afterwards. Because we immediately adopted it into teaching. And as I think you

know, I think it's programmes exactly like that which actually helped popularise corpus linguistics,
because prior to that, it was very difficult to do a lot of the things that you wanted to do. So certainly in

the early '90s to mid'90s here, when we wanted to do things like build word lists, we would write a
programme to do it.

Exactly.

Put it on the mainframe and get the answer out. It was all of a sudden, with word lists, colleagues--
word lists, for instance, WordSmith and key words, and especially WordSmith, colleagues maybe like

even Norman Fairclough here certainly start to think, actually, this is within reach of mortals. I could get

someone to do this, or I could do it, too.

Because not everybody knows how to write a computer programme--

Or find and tame a computer programmer, but you were the sort of publicly accessible computer

programmer.

Well, always my model of this was this student of mine, Fadula, who was a master's level student. So

she was wanting to get on with her work, but really she was a language teacher, she wasn't really into
computer science in any way. She just needed word lists to be able to get on with what she needed to

be able to do.
8
And the model of somebody going to a laboratory, where there were white-coated technicians and air-

conditioned machines, as was the case still back in those days, that didn't suit. I mean, people in a high-
powered situation, then you'd have very specialised computers and very specialised resources. And my

idea was always for the master's level student, the PhD student, for somebody who wants to have a

crack at it but doesn't want to go into a big lab and sign up--

And turn themselves into a computer scientist overnight.

Exactly, yeah.

Well, I wonder whether one issue you might have encountered, because it was an issue I encountered,

was the issue of scale. Some solutions that work when you were working with smaller corpora as larger
corpora came along, suddenly that algorithm or way in which you derive the word list wasn't efficient

enough anymore. Did you experience any of that? I know I did.

In the earliest days, of course, you were dealing with very small amounts of text. The size of corpora

started to grow, and impressions that people would form on the basis of 20,000 words of text which at
some point way back in the 1990s was considered--

Enough.

A corpus for something. I mean, in a way, there's the whole issue of specialised corpora, isn't there,
where there's never going to be anymore Sumerian text that will be produced, because it's a dead

language. So that whole issue is quite a complicated one, and it has grown with time. And now of

course, we've got--

Billions.

Billions of word corpora with billions of words. WordSmith isn't really designed for such huge things. It's

designed for something in a way in the millions, in the 10s or 100s of millions, that's all.

But it's interesting that even over the last 15 years, say, you would've have had to change some things,

I dare say, because initially you might have been dealing with people who might have 40,000 to a
million words, and now routinely, just by grabbing newspaper text they might come at you with 10

million words.
9
What I think has caught me more in that time, there has been this scale change, and that has been

important. But what has been more important to me is the multi-word nature of language. So I think that
back in the 1990s, we were thinking, well, I think I was thinking anyway, very much that a word was a

word. And I knew that a word, I knew about lexico-grammar.

A word wasn't necessarily to be either a sheep or a goat. It was either a function word or a lexical word.

It could play both roles at once. And I had some understanding of that.

But I think very much the focus was on individual words, like mug or table or graceful or six or because.

These are words that we wanted to understand and find how they worked. And then we'd look further,
and then you'd find words where it was the company they were keeping.

And, of course, now, collocations have grown out of all of that. And collocation was still very much a

gleam in somebody's eye in a way. Obviously, it goes right back many decades, but at the same time,

back I think in the early '90s, I don't think the power of collocation was as visible as it is now.

I'd certainly agree with you on that.

And then the multi-word unit starting to come out.

It did start to come out.

And all the work with n-grams and skipgrams and concgrams and all these different things.

And it was around that time that Renouf and Sinclair started to do collocational frameworks, which

actually were also a variety of multi-word unit when you look at them, with thick standards and variable

slot fillers. So yes, there was definitely a lot in the air around collocation.

I think that was all in the air, but it was very difficult. And it still is very difficult to study, because all that

early work-- Sinclair did all the crucial work way back, showing that if you look beyond about four words
to the left and right of the word you're interested in, you will get so many words coming up that you will

lose the patterning that you suspect must be there because of all the noise of millions of kind of

essentially unrelated words coming into the bucket that you're looking at, into the frame. But at the
same time, there must actually be relationships between words that are quite a long way away from

each other.
10
And you can find famous examples of that in the literature especially. So in terms of other challenges

there now, you mentioned the contact with users before. And one of the great things about your
programme is over the years you've been extremely responsive to people writing to you saying, have

you thought of adding this or there's a bug there, would you fix that? Have there been notable

developments generally to do with your software over time by user--

Oh, enormously so, yeah. Because basically back in the 1990s, I kind of imagined that I would know
what people might want, and that was quite wrong. For example, in a word list, I assumed that you

would want the word list to be capitalised.

And why would you want to have a word list which distinguished between the with a capital T and the

without a capital T? Well, somebody who's studying sentence relations might well be wishing to look into

all sorts of things where they would use capitalisation. So I have people writing to me saying why does
your word list--

Take the capitalisation away?

Take the capitalisation away. Why does it sort of damage the text in a way? Can you please put in a

switch, so that we could have that optional?

And I'd get people writing in who were students of other languages other than English. And also a major

preoccupation I had was always, let's see if we can make it possible to allow pretty well any language to

be studied. So that means, don't build into the software any knowledge of English, although I might be
in a position to find out enough things about English to do that, I wouldn't be in a position to find out

things about Lithuanian and Polish and Farsi, and so there's no way that could be done. So let's keep it

language neutral.

So provide and don't preclude, I might say.

Provide and don't preclude and allow for when people write in and say, well, I'm actually looking at the

poetry of the Middle Ages in Greece or something, and please can you allow for that fact, because it
has this implication. So they would write to me and say, this is what I'm doing. And I'd say, oh, how

interesting. Tell me more.

And then, sometimes, I would be able to say, yes, I think I can probably accommodate that. And
11
And then, sometimes, I would be able to say, yes, I think I can probably accommodate that. And

sometimes I'd have to say, no, I'm afraid that would get it too specialised and I can't do it. But mostly, I
would like to, because it's basically a hobby. The whole thing started out as an accidental thing, a

hobby.

So you're happy to do it.

If people said, can it be done? It's a challenge. And I would like to meet that challenge, and see, well,

I'm bloody well going to try to do it.

Well, I have a memory, and it might be one of those unreliable memories. But I have a memory of when

you came up to talk to us about WordSmith, around that time, I was very interested in the question of
dispersion and the work of Alphonse Juiland, and we had a chat about it. And to my cheery delight,

sometime later, you did these dispersion plots in WordSmith. So I thought that was just wonderful.

Well, it's a wonderful idea, and I wasn't aware of it. And there is this possibility, can it be done? Is there

something in the design of the software which makes that impossible? Sometimes the way you set out
the whole design of the structure of the machinery you can't do it. But often, the machinery will permit it,

if you actually think your way through it.

So it's a problem solving little puzzle -- can it be done? Now, will it be useful? Well, it depends on the

user, doesn't it. And the plotting facility that I put into WordSmith with the key words and the

concordance plotting is only I think useful on some occasions to some people. And that's typical of
many of the resources. I expect there are people who never look at the collocates perhaps or--

Who knows?

Who don't find that interesting or challenging or useful for their research.

But on dispersion plots, only last Friday, someone who's quite new to corpus linguistics was looking at
something with me. And they said, oh, but what if all of those examples are just from one text?

Dispersion plot nicely dispersed across the corpus, and they were delighted. They were pleased that

this wasn't some type of invalidation.

That's interesting you mention that, because that's another thing that I had to bring in. And I think
probably this millennium, the notion of how many texts is this true off? Because you might say, this is

12
very interesting. But it's really only in a tiny cluster of texts that we find this phenomenon, and the rest of
the time either it's not there or it's quite different.

It behaves very differently. And so there's an extra column that came in where how many texts is this,
so that came into the word listing tool.

That's very handy, very handy. So what do you think the future is for corpus searching tools? If you had

your life over again-- and rest assured, Mike, if I could give you it over again, I would wave my wand,
and you'd have it-- but if you could have your life over again, and you had another 100 years to work on
this topic, where would you be taking it?

I think the one certain thing we'd know is that it's not possible to predict it. That seems to me to be fairly

safe. Because if you go back 100 years, and you say, you've got somebody who's just invented the
motor car or some of these early technologies. And then you can't really imagine the problems of urban
transport and motorways getting clogged up.

And it's difficult to go back into the head of somebody in 1920, 1913 in this particular case, and have

them imagine what our world is like now. And I think in 100 years time, goodness knows what people
will be doing. But I think I'm very lucky, because having been in at the beginning of this kind of corpus
linguistics done on computers that you can hold in your hand or having in a room without having a

technician with you. I think I've been pretty lucky to have got in and as a complete amateur sort of on a
self-taught basis.

I don't think that would be easy to get in at all as an amateur now perhaps.

No, I could see that.

Well, probably one could.

Yes.

But in a hundred years' time, I wonder whether it will all be kind of you will need qualifications to be able
to get in, perhaps.

But I would guess then that your engine for the future would probably be individuals using the data and
coming forward with ideas for things that they think need doing.

13
Yes, well, it couldn't not be that, yeah. But I also think half of all of those dimensions of text which we

aren't looking at at the moment, because we don't really have the means of doing it well, sound, the
accompanying sound, the smell, all the senses, all the visuals, if you like, the equations in the
mathematical part of it, the captions below pictures and all that sort of stuff. There's lots of stuff which

we're losing, because we're looking at plain text still. I think that's certain to come in, but how it will work,
and how well?

How many corpora we will have, like the Ted conferences where you can have the video and the
transcript, of which we already can get. That's just beginning. And I think that will be in the very near

future, we'll be concordancing them. With WordSmith you can do it already, in a simple sort of way. I
could imagine that growing enormously.

Yeah, I see that. Well, I think you're right. It was an enormous privilege to have been around anywhere
near the start of it. Apart from anything else, one of the nice things was everything was novel.

Every time you wrote a little programme to do x, that somebody hadn't written, it was absolutely new.

And you were seeing it for the first time. Whereas nowadays, a lot of those things have become banal in
a technical sense.

They're so easily derived, they're everywhere, word lists, things like that. It's difficult to at times think

back to the excitement you'd get from seeing a word list for the first time, from a million or 10 million
words or something.

Suddenly realising that half of the ideas that we had about how language works were half-baked or
were based on some sort of supposition, perhaps from some person with a long grey beard. And they

turned not really to be very true at all. People being so sure.

And I remember as a language teacher, I was so sure. At the age of 25, I was absolutely sure what was
right and what wasn't right. And now you google the thing that I would have said no, that's not English,
and you find lots of examples of it.

I always tell my students, they're not going to get it right. But then, they get it more right than the person

that was in it previously. And that's really what the development of human knowledge is all about,
incremental baby steps towards something that may be right.

14
But I agree with you, certainly the most scandalous claims about correctness have been made with
great assuredness by some people. But actually the better route to take is one of humility where you

say, this is my best estimate at this. Help me find out where it's wrong, and we'll push it further--

And corpus tools give you a chance, they don't give you a chance really to prove anything very much.
But they do give you a chance to examine a lot more data. And then feel more confident, and say, well,

the claim that was made earlier is actually not very well-founded, or it needs to be modified in some way
or other.

Oh, and they are a wonderful source of falsification, as well. Certainly one of my books, I've looked at
certain sentence types that Chomsky said could never occur. You find them, you find them very quickly.

Schwartz [INAUDIBLE], there's one.

And it's perfectly fine. Nobody fell over when that sentence was written or spoken. And it's just, yeah,
they're powerful tools for falsification, as well. Well, it's been great chatting to you, Mike.

Good. Happy to.

And I hope to see you again before too long.

Yeah, thank you.

15

Anda mungkin juga menyukai