Anda di halaman 1dari 23

Digitised History

Build a million-page library and


search it instantly, with high-powered search tools

V1.02
www.househistories.org
Introduction Create a Library Basic Search Advanced Search Indexed Search

Introduction
As you are reading this, historical collections around the world are being scanned and digitised at an ever increasing rate. The public is demanding easy access to historical
information, and libraries are incentivised to reduce shelf space and manual handling of fragile documents. The mist of history is lifting as digitised content grows, and future
generations will have access to historical information on a scale that we can only dream of today.

As researchers we are left with the challenge of searching the growing amount of data for specific topics and terms. In this instruction we will describe how to create a digital
library and search for terms and phrases using a high-powered file search program. The method uses “brute force” computing, rather than onerous indexing, to scan collections
with almost immediate results. Using this method we will also be able to locate:
▪ Specific document sources, based on the structure of our library
▪ Variations to search terms, to account for imperfect digitisation
▪ Whole phrases and variations to phrases
▪ Multiple terms used in the same context
Etc.

This instruction will use a digital library compiled by the www.househistories.org project, which currently contains about 7,200 files with a page count in excess of 1 million.
The collection focusses on the Australian state of Queensland, but it could be any subject. The library could be tens or one hundred times larger – the same method applies.

Please note that:


1) The documents need to be typeset or machine typed. Hand-written text cannot (yet) be digitized.
2) For this instruction we assume that your library contains Acrobat or PDF files. However – the software can handle almost all file types, including .doc, epub etc. Your
library can comprise just about any file type containing text.
3) PDF documents must be OCR (optical character recognition) converted. Many PDF documents already have OCR, and the ones that don’t can easily be converted.

Your digital library will require:


1) If using PDF documents, a subscription to Adobe Acrobat, available at www.adobe.com
2) A license to use the “FileLocator Pro” software, available at https://www.mythicsoft.com/filelocatorpro
3) A standard-specification PC with Windows
4) Sufficient hard drive space, internal or external, to store the library
5) A hard drive or “cloud” drive, to back up your collection

Digital Historian www.househistories.org 2/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Create a Digital Library


Scan and digitise books and other physical documents
You can digitise your own hard-copy documents. The process involves scanning and conversion of the documents to PDF or other formats. Options include:
▪ Resourceful organisations may choose to purchase a purpose-built, professional book scanner
▪ Crafty DIYers may decide to build their own book scanner, using instructions from sources such as this: www.diybookscanner.org
▪ Books and documents may be submitted to professional scanning services
▪ Documents can be scanned using a simple flat-bed scanner, although this may be slow and difficult for bound books
▪ Pages may be photographed using a high-resolution camera

Once scanned, the images can be converted to PDF format and merged into continuous documents using Adobe Acrobat. The next step is to apply the Acrobat OCR “Text
Recognition” layer, which is also easy. We won’t reproduce the instructions here – just refer to the Acrobat manual.

Check the copyright legislation in your area, to ensure that you do not infringe intellectual property laws when digitising documents.

Find digitised documents


There are many sources out there that you can search for your particular topics, for example:
▪ Your national, state and council libraries and archives, which may offer you the option to search specifically for digitised documents
▪ Government departments, institutions and agencies, that often have collections of digitised historic materials (although they may be well hidden in their websites)
▪ www.archives.org, containing a huge and growing range of digitised collections from around the world
▪ Internet searches using Google or PDF-specific internet search engines, which can find digitised books and documents of any topic in the most far-flung and surprising
locations.
▪ Commercial retailers of digitised documents, for example genealogy specialists

Whatever your chosen subject, it is well worth spending some time in Google and library sites. You will be surprised what you can find.

Again - be sure to check the copyright legislation in your area.

Digital Historian www.househistories.org 3/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Organise your library

Your digital library will, quite simply, consist of a series of folders containing collections of PDF documents.

In order to maintain order and facilitate searches, it is important that you create a structure that can expand as your collection grows. We recommend that you use only
one level of folders, divided into “Series” of specific document types. The series in our example library are named as follows:

A Series - People - Electoral rolls, post office directories, telephone directories.


B Series - Books, brochures, pamphlets
C Series - Almanacks, commercial directories
D Series - Brisbane City Council
E Series - Queensland/NSW Government
F Series - Journals, periodicals
G Series - Other

Each folder is named using the Series letter, number, a description of the documents that it contains and
the date range of those documents.

The key is to have all your folders in one “flat” list, ideally with no “nesting” of folders within folders.
Any nesting will make the library difficult to manage as it grows. You want to be able to “see” all your
folders in one long list.

The purpose of using “series” is to enable a restricted search to specific types of documents. In the
example library, The “A” series contains electoral rolls and directories, and is used primarily to search for
people and locations. The “E” series contains only materials produced by the Queensland Government. It
is likely that a search will cover only one of these two series, hence the search will be more targeted if we
can limit the number of series to include.

In practice, the library will end up looking something like the extract to the right.

Digital Historian www.househistories.org 4/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Inside the folders

The image to the right shows part of the folder “A002 QLD Post Office and Phone
Directories 1868 – 1959”. As you can see, each of the PDF documents is named and
dated.

The excerpt below is from the folder “B001, Books, Pamphlets”. This folder contains
about 1,100 digitised books and other texts relevant to Queensland History that have
been sourced up from repositories all over the world, and some books that we have
digitised ourselves. They are all named with the title, author and year.

You can come up with any document naming convention that makes sense to you - it
won’t impact the search.

Digital Historian www.househistories.org 5/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Basic Search
Set up FileLocator Pro
Purchase FileLocator Pro using the link on Page 2, and install the program.
When you open the program for the first time you will see the below view. Follow the two simple steps for initial set-up.
1 This drop-down box should be set to “Expert”, to allow you to access some of the more advanced tools
2 Click Window > Contents View > Dock Below

Digital Historian www.househistories.org 6/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Step 1 – Choose folders, enter the search term and start the search
To illustrate the process we will search for the term “bullroarer”, an aboriginal ceremonial instrument.

1 Click the folder selection icon. A file explorer window opens. Navigate to your library, hold down the “Ctrl” button and click on all the folders that you want include in the
search. In this case we have excluded the “A” series – a very large collection which is mainly concerned with people. All other folders have been selected.
2 Click “select folder” to close the window.
3 Enter the term bullroarer in the ‘Containing Text” field. The basic search is not case sensitive.
4 Press “Start”

4
3
1

Digital Historian www.househistories.org 7/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Step 2 – Watch as the “hits” pile up


The speed of the search will depend on your system, the number of files selected and the size of each file. It can easily take several minutes for a large collection.

1 The bottom bar shows the current status of the search and the total number of documents included in the search
2 As the search gets underway, the file view window will list all the documents found to contain your search term
3 The contents view window will show the terms and their context in each of the documents. Make sure that you have selected the “Hits” tab.

Digital Historian www.househistories.org 8/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Step 3 – Review the search results


When the search has completed you can preview each of the hits. As you select a document in the file view window, the “hits” and the surrounding text are displayed in
the contents window. This will give you an idea of the context of each hit, and whether you want to open the actual document to look closer at it.
In this case we want to view “Bullroarers used by Australian Aborigines, Mathews 1897”.
1 Double-click the document PDF icon.

Digital Historian www.househistories.org 9/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Step 4 – Search the PDF document


The PDF document opens up in Acrobat. To find the term in the document;
1 Click Edit > Find. The “Find window opens.
2 Type in your search term, in this case bullroarer. The term is not case sensitive.
3 Click “Next”

Digital Historian www.househistories.org 10/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

As you click “Next”, Acrobat will show you each successive occurrence of the term in the PDF document. The terms is highlighted in blue as shown below.

Digital Historian www.househistories.org 11/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Advanced Search
Search for multiple terms
You can search for documents that contain two or more terms, for example australian and bullroarer. The search will return only documents that contain both terms.
1 Type both terms into the “Containing text” field. If desired, you can add a third and fourth term.
2 The contents view window will show all occurrences of the terms in each document, with each term highlighted in different colours.

Digital Historian www.househistories.org 12/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Search for phrases


FileLocator Pro can be used to search for phrases or “strings” of terms in a particular order.

1 Type the string into the “Containing text” field, and put hyphens on either side of the string. In this case we’re looking for “small bullroarer”
2 The contents view window will show all occurrences of the string in each document found.

Digital Historian www.househistories.org 13/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Search for variations


A drawback of the PDF digitisation process is the propensity for characters to be incorrectly “read”, and translated into the digital document overlay. A term that reads
“Australia” in a text may be translated to “Aastraba” in the digital overlay.
A related issue in historical research is the tendency for names and terms to change spelling over time. We can overcome these problems by searching for variations of
terms and phrases, that we can review manually to pick out the relevant “hits”. For this we use the “LIKE” command, which needs some initial configuration:
1 First, click on Tools > Configuration. The Configuration window opens
2 Scroll down to “Boolean Expressions” and click on it
3 In the window, move the ‘LIKE Sensitivity” to “Very Similar”

Digital Historian www.househistories.org 14/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

We will now search for terms that are very similar to the term “bullroarer”.
1 Type LIKE bullroarer into the search field (using capitals for LIKE) and press Start.
2 In the contents view window, we can see several instances of bullroarer which have been incorrectly digitised to “buUroarer”. In other documents we
have scored hits on close matches such as bullarer (a place name), which can be ignored.

Digital Historian www.househistories.org 15/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Search for terms in close proximity


In some cases we want to search for multiple terms in a document that are used in the same context, or the same section of text. This is different to the “search for multiple
terms” function, which returns documents containing the terms in any location.
First we must configure the program to look for the terms within a certain proximity. As a default setting, we recommend that you use 1800 characters which is roughly
one page of a book. You can change the distance as required.

1 Click on Tools > Configuration. The Configuration window opens


2 Scroll down to “Boolean Expressions” and click on it
3 Change the “Max distance between terms (in characters)” to 1800, or your distance of choice

Digital Historian www.househistories.org 16/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

We will now search for bullroarer used in proximity to ceremony.


1 Type bullroarer NEAR ceremony into the search field (using capitals for NEAR) and press Start.
2 In the contents view window we see instances of bullroarer found in proximity to ceremony, with the two terms highlighted in different colours.

Digital Historian www.househistories.org 17/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Indexed Search
Create the Index
Watching a search in-progress can be quite meditative, with the document count rising and hits filling the file view window. But as your library grows the searches will
take longer, and professional researchers will demand immediate results.
FileLocator Pro offers an indexed search functionality which is easy to set up and produces instant results. To do this, we must instruct the program to build an index file
for your documents, which can then be searched very quickly.

1 In the drop-down list to the top right, select “Index Search”


2 In the adjacent drop-down, select “Create new index”

1
2

Digital Historian www.househistories.org 18/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

In the next screen, select “Create New Index”. The “Create a New Index” screen appears.

1 You can select any location for the index, or you can leave the default location as-is
2 In the “Index Locations” section, untick the “Standard document locations” and tick “Specific locations”
3 Click the folder icon, navigate to your library and select all folders in the library.
4 Click “select folder” to close the window

Digital Historian www.househistories.org 19/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

During the indexing process you will see the below status window. For the million-page example library, a new index will be created in a couple of hours. Note that the
index can be large in its own right – the example library of 135 GB generates an index of 1.1 GB.
1 When the index is completed, close the Index Manager window.

Digital Historian www.househistories.org 20/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Search the Index


We can now use the index to search the whole library using the same commands as before –single terms, multiple terms, LIKE and NEAR commands.
1 In the drop-down list to the top right, make sure that “Index Search” is selected
2 In this case we are searching for flog NEAR triangle. As you type your terms into the search field the document hits will appear instantly. Note that as
you select the documents, it will take a few seconds for the text to appear in the contents view.

2 1

Digital Historian www.househistories.org 21/22


Introduction Create a Library Basic Search Advanced Search Indexed Search

Update the Index


Every time you add or delete documents in your library you’ll need to update the index, or the new documents won’t be included in the search.
1 “Click the Edit Index Settings” icon. The Index Manager window appears.
2 Click the “Update Index” icon. The update process may take several minutes depending on the size of the index.

Digital Historian www.househistories.org 22/22


For the current version of this document, go to: www.househistories.org

Questions and feedback: magnus.eriksson.ercons@gmail.com

Anda mungkin juga menyukai