V1.02
www.househistories.org
Introduction Create a Library Basic Search Advanced Search Indexed Search
Introduction
As you are reading this, historical collections around the world are being scanned and digitised at an ever increasing rate. The public is demanding easy access to historical
information, and libraries are incentivised to reduce shelf space and manual handling of fragile documents. The mist of history is lifting as digitised content grows, and future
generations will have access to historical information on a scale that we can only dream of today.
As researchers we are left with the challenge of searching the growing amount of data for specific topics and terms. In this instruction we will describe how to create a digital
library and search for terms and phrases using a high-powered file search program. The method uses “brute force” computing, rather than onerous indexing, to scan collections
with almost immediate results. Using this method we will also be able to locate:
▪ Specific document sources, based on the structure of our library
▪ Variations to search terms, to account for imperfect digitisation
▪ Whole phrases and variations to phrases
▪ Multiple terms used in the same context
Etc.
This instruction will use a digital library compiled by the www.househistories.org project, which currently contains about 7,200 files with a page count in excess of 1 million.
The collection focusses on the Australian state of Queensland, but it could be any subject. The library could be tens or one hundred times larger – the same method applies.
Once scanned, the images can be converted to PDF format and merged into continuous documents using Adobe Acrobat. The next step is to apply the Acrobat OCR “Text
Recognition” layer, which is also easy. We won’t reproduce the instructions here – just refer to the Acrobat manual.
Check the copyright legislation in your area, to ensure that you do not infringe intellectual property laws when digitising documents.
Whatever your chosen subject, it is well worth spending some time in Google and library sites. You will be surprised what you can find.
Your digital library will, quite simply, consist of a series of folders containing collections of PDF documents.
In order to maintain order and facilitate searches, it is important that you create a structure that can expand as your collection grows. We recommend that you use only
one level of folders, divided into “Series” of specific document types. The series in our example library are named as follows:
Each folder is named using the Series letter, number, a description of the documents that it contains and
the date range of those documents.
The key is to have all your folders in one “flat” list, ideally with no “nesting” of folders within folders.
Any nesting will make the library difficult to manage as it grows. You want to be able to “see” all your
folders in one long list.
The purpose of using “series” is to enable a restricted search to specific types of documents. In the
example library, The “A” series contains electoral rolls and directories, and is used primarily to search for
people and locations. The “E” series contains only materials produced by the Queensland Government. It
is likely that a search will cover only one of these two series, hence the search will be more targeted if we
can limit the number of series to include.
In practice, the library will end up looking something like the extract to the right.
The image to the right shows part of the folder “A002 QLD Post Office and Phone
Directories 1868 – 1959”. As you can see, each of the PDF documents is named and
dated.
The excerpt below is from the folder “B001, Books, Pamphlets”. This folder contains
about 1,100 digitised books and other texts relevant to Queensland History that have
been sourced up from repositories all over the world, and some books that we have
digitised ourselves. They are all named with the title, author and year.
You can come up with any document naming convention that makes sense to you - it
won’t impact the search.
Basic Search
Set up FileLocator Pro
Purchase FileLocator Pro using the link on Page 2, and install the program.
When you open the program for the first time you will see the below view. Follow the two simple steps for initial set-up.
1 This drop-down box should be set to “Expert”, to allow you to access some of the more advanced tools
2 Click Window > Contents View > Dock Below
Step 1 – Choose folders, enter the search term and start the search
To illustrate the process we will search for the term “bullroarer”, an aboriginal ceremonial instrument.
1 Click the folder selection icon. A file explorer window opens. Navigate to your library, hold down the “Ctrl” button and click on all the folders that you want include in the
search. In this case we have excluded the “A” series – a very large collection which is mainly concerned with people. All other folders have been selected.
2 Click “select folder” to close the window.
3 Enter the term bullroarer in the ‘Containing Text” field. The basic search is not case sensitive.
4 Press “Start”
4
3
1
1 The bottom bar shows the current status of the search and the total number of documents included in the search
2 As the search gets underway, the file view window will list all the documents found to contain your search term
3 The contents view window will show the terms and their context in each of the documents. Make sure that you have selected the “Hits” tab.
As you click “Next”, Acrobat will show you each successive occurrence of the term in the PDF document. The terms is highlighted in blue as shown below.
Advanced Search
Search for multiple terms
You can search for documents that contain two or more terms, for example australian and bullroarer. The search will return only documents that contain both terms.
1 Type both terms into the “Containing text” field. If desired, you can add a third and fourth term.
2 The contents view window will show all occurrences of the terms in each document, with each term highlighted in different colours.
1 Type the string into the “Containing text” field, and put hyphens on either side of the string. In this case we’re looking for “small bullroarer”
2 The contents view window will show all occurrences of the string in each document found.
We will now search for terms that are very similar to the term “bullroarer”.
1 Type LIKE bullroarer into the search field (using capitals for LIKE) and press Start.
2 In the contents view window, we can see several instances of bullroarer which have been incorrectly digitised to “buUroarer”. In other documents we
have scored hits on close matches such as bullarer (a place name), which can be ignored.
Indexed Search
Create the Index
Watching a search in-progress can be quite meditative, with the document count rising and hits filling the file view window. But as your library grows the searches will
take longer, and professional researchers will demand immediate results.
FileLocator Pro offers an indexed search functionality which is easy to set up and produces instant results. To do this, we must instruct the program to build an index file
for your documents, which can then be searched very quickly.
1
2
In the next screen, select “Create New Index”. The “Create a New Index” screen appears.
1 You can select any location for the index, or you can leave the default location as-is
2 In the “Index Locations” section, untick the “Standard document locations” and tick “Specific locations”
3 Click the folder icon, navigate to your library and select all folders in the library.
4 Click “select folder” to close the window
During the indexing process you will see the below status window. For the million-page example library, a new index will be created in a couple of hours. Note that the
index can be large in its own right – the example library of 135 GB generates an index of 1.1 GB.
1 When the index is completed, close the Index Manager window.
2 1