Anda di halaman 1dari 160

Raymond Nelson – Forensic Psychophysiology

Indianapolis, Indiana / Denver, Colorado


(317) 992-7659 / (917) 293-3208 / (303) 585-0599
raymond.nelson@gmail.com

CONFIDENTIAL – POLYGRAPH EXAMINATION REPORT

April 1, 2019 Examination Number: 20190401Steyn

Barry Taylor <icarus1943@gmail.com>


Forensic Investigator

Ben Lombaard <ben@lietech.co.za>


LieTech Polygraph & Forensic Services
Cape Town, 7530

SUBJECT: Chris Steyn; DOB: 09/22/60


EXAMINER: Raymond Nelson

PREDICATION:

On April 1, 2019, a polygraph examination was administered to Chris Steyn from 11:00AM to 2:00PM
at the offices of LieTech Polygraph & Forensic Services in Western Cape. Referral for examination was
made by forensic investigator Barry Taylor, with Ben Lombaard as an intermediary or local
contact/representative.

The purpose of the examination was to determine the veracity of Ms. Steyn’s statements regarding her
recently published book “Lost Boys of Bird Island” (approx August 2018) regarding her journalistic
investigation of a series of events during the late 1980’s. Ms. Steyn’s co-author Mark Minnie was
formerly employed as a homicide investigator with the police in South Africa and is deceased since the
time of her 2018 book publication. Mr. Steyn’s book includes sourced information that implicates a
number of persons who were public figures at that the time of the reported events.

Steyn polygraph report - Page 1 of 10


Referral information indicates that questions have arisen since the publication of her book regarding
whether Ms. Steyn has falsified the reported information in her book, or has falsified any of the
information sources used in the development of the reported information. No legal proceedings have
been made known to this writer in the referral for this examination, and none were made known at the
time of the examination. Referral for this examination is in the context of a private investigation.

Prior to the examination, Ms. Steyn read, verbally reviewed, and signed a written consent/authorization
form explaining the right of examinee during polygraph testing, including the right to terminate the
examination at any time. The consent/authorization form explained that standardized procedure
requires that the examination procedure is explained to the examinee, and that all test questions are
reviewed prior to attaching any physiological recording sensors and prior to recording any responses.
Upon signing the consent/authorization form Ms. Steyn agreed agreed to release this examiner/author,
Raymond Nelson, including any agents, employees, employers, or affiliates of any liability resulting
from this polygraph examination. Ms. Steyn was informed that the entire examination would be
recorded, and that the examination data and findings would be subsequently released to the referring
professionals – Barry Taylor and Ben Lombaard.

EXAMINER QUALIFICATIONS:

Raymond Nelson is a psychotherapist, behavior scientist, and polygraph examiner, currently working
as polygraph researcher, trainer, and field examiner. Mr. Nelson is universally known in the industry
today. In addition to his expertise as a statistician and researcher, he is a National Certified Counselor
(NCC), with expertise in mental health counseling and psychotherapy with convicted sex offenders,
victims of sexual abuse, incestuous families, in addition to family therapy, attachment and rage-filled
children and adolescents, violence, severe neglect, abandonment, trauma disorders, developmental
disorders, neurologically based learning disorders, personality disorders, and chronic mental illness –
having worked in intensive residential and outpatient and experiential treatment programs for many
years prior becoming involved with the polygraph profession in 2000. Mr. Nelson is a past Vice
President of the Colorado Association of Polygraph Examiners, and past President of the American
Polygraph Association (APA) - currently serving as an elected member of the APA Board of Directors.

Mr. Nelson has expertise is all aspects of psychometric testing, psycho-diagnostics, forensics and
research. As a researcher in the polygraph profession, Mr. Nelson has published upward of 150
scientific studies and instructional articles on all aspects of the polygraph test, including: interviewing,
test target selection, test question formulation, test formats, manual test data analysis, automated
computer algorithms, faking/countermeasures, the psychological and physiological basis for polygraph
testing, scientific testing in general, use of automated analytic models, polygraph validity and
reliability, suitability for polygraph testing, polygraph recording sensors, digital signal processing, and
polygraph instrumentation, and polygraph field practices.

Mr. Nelson has been instrumental in polygraph policy development at the local, state and national level
including authoring and lobbying polygraph policies for sex offender testing, suitability for polygraph
testing, public safety applicant testing, polygraph records management, domestic violence, domestic
fidelity, and polygraph quality control. Mr. Nelson’s other activities include policy development for
psychometric assessment and risk assessment for adults and adolescents who are convicted of sexual
offenses. Mr. Nelson is also the author of a street-screening tool for suicide risk.

Steyn polygraph report - Page 2 of 10


Mr. Nelson has expertise in statistics, analytics and data science, and is the developer of the powerful
and open-source Objective Scoring System version 3 (OSS-3) computer scoring algorithm – available
in nearly all computerized polygraph systems worldwide. Mr. Nelson is also the developer of the
Empirical Scoring System (ESS) and Empirical Scoring System – Multinomial (ESS-M) – perhaps the
most commonly used world-wide protocol for manually scoring polygraph test data. Mr. Nelson’s
research includes publications on validity and field practice with a variety of commonly used polygraph
test formats for both event-specific diagnostic exams and multiple issue screening polygraphs,
including the U.S. Air Force Modified General Question Technique, the Federal Zone Comparison
Technique, the Directed Lie Screening Test (Test for Espionage and Sabotage), the Federal You-Phase
technique, the Backster You-Phase technique, the Utah Zone Comparison Test, and the Utah Modified
Comparison Test (Raskin Technique).

Mr. Nelson is a Research Specialist with Lafayette Instrument Company, a global leader in developing
and manufacturing Polygraph instrumentation. Mr. Nelson is also the curriculum director for the
International Polygraph Studies Center (IPSC) an APA accredited polygraph training program that
operates in Mexico City and provides training to government agencies and private security agencies in
several countries. Mr. Nelson was the principal investigator for the 2011 Meta-analytic Survey of
Validated Polygraph Techniques published by the American Polygraph Association, and is the author of
a 2015 publication on the scientific basis for polygraph testing.

Mr. Nelson is a frequent trainer and lecturer in polygraph and other topics at state, local, national and
international conferences. Mr. Nelson Mr. Nelson has taught at schools and conferences in numerous
states in the U.S., including: Colorado, Indiana, California, Oregon, Nevada, Arizona, New Mexico,
Texas, Louisiana, Florida, South Dakota, New Jersey, Massachusetts, Oregon, Washington, Michigan
and New Hampshire, in addition to providing training at schools and international conferences in
Canada, Ecuador, Mexico, Colombia, Panama, Honduras, Peru, Guatemala, Singapore, Israel,
Philippines, Thailand, South Africa, Malaysia, Romania, Poland, Spain, Romania, Slovakia and other
countries. Mr Nelson is a member of the American Polygraph Association, the American Association of
Police Polygraphists, the National Polygraph Association, the International Association of Professional
Polygraphists, Mr. Nelson is also a member of the American Statistics Association and an associate of
the American Academy of Forensic Science.

As a practicing field polygraph examiners, Mr. Nelson has conducted approximately 4000 polygraph
exams on persons convicted of sexual offenses and other violent offenses. Mr. Nelson has testified as
an expert witness, on polygraph and psychotherapy matters in a number of court cases including
municipal, district, appellate, superior and supreme courts, as well as arbitration hearings, military
courts, and administrative law courts. Mr. Nelson is the recipient of several awards, including the
prestigious Cleve Backster Award and Leonard Keeler Award from the American Polygraph
Association, addition to the prestigious Max Wastl Award President's Award for Distinguished Service
from the American Association of Police Polygraphists.

POLYGRAPH TESTING:

Polygraph testing – sometimes referred to as “lie-detection” as a term of convenience – is a


standardized, evidence-based test of the margin of uncertainty or level of confidence surrounding a
categorical conclusion of truth-telling or deception regarding the test target issue that is under
investigation. The analytic theory of the polygraph test is that greater changes in physiological activity

Steyn polygraph report - Page 3 of 10


are loaded at different types of test stimuli as a function of deception or truth-telling in response to
relevant target stimuli. Suitable or ideal polygraph target issues will involve a behavioral act for which
the examinee can know unequivocally whether one has engaged in that action.

The polygraph test must be completed in a standardized manner, in a context in which the examinee
can adequately attend to and concentrate on the test topic and test stimuli. Completion of a polygraph
test can take upwards of 90 minutes, consisting of a pretest interview, test data acquisition, and test data
analysis phases. Post-test procedures can include other ancillary activities, including additional
discussion, testing and investigation.

Polygraph test data are a combination of physiological proxies that have been shown to correlate
significantly with different types of test stimuli as a function of deception or truth telling in response to
the relevant investigation target stimuli. Polygraph recording sensors include: respiration activity,
cardiovascular activity, and electrodermal activity, in addition to an optional vasomotor activity sensor.
Standards of practice today require the use of a somatic activity sensor, intended to aide in the
identification of attempted test faking.

The psychological basis of responses to polygraph test stimuli can be thought of as involving a
combination of factors including: attention, cognition, emotion, behavioral conditioning and other
resources for self-control during deception. Suitability for polygraph testing requires that an individual
is functioning within reasonably normal limits. Nether theoretically referenced nor norm-referenced
probabilistic computations can be expected apply to individuals for whom the basic theory does not
apply, or those who are functioning outside normal limits for an intended testing population. Most
persons who can work, drive, attend school, live in the community and function independently are
suitable for poly- graph testing.

Polygraph testing does not detect measure lies or deception per se. Any actual detection of deception
would require an ability to observe deception with deterministic perfection – immune to random
variation and immune to influence from human behavior. Actual measurement of lies or deception
would require a physical phenomena for deception and a defined unit of measurement. The purpose of
any scientific test is to quantify some phenomena of interest for which neither perfect deterministic
observation nor direct physical measurement are possible – and for this reason, all scientific test results
tests are fundamentally probabilistic. Scientific tests are not expected to be infallible, and are expected
only to quantify the margin of uncertainty or level of confidence that can be assigned to a conclusion.

Analysis of polygraph test data, like other test data, involves several processes, including: feature
extraction, numerical transformation and data reduction, calculation of a likelihood function, and the
use of structured rules to parse a categorical test result from the numerical and statistical test data.
Mean accuracy rates of event-specific single-issue polygraphs can range from the high .80s to low-mid
.90s, depending on the test format and other parameters. Multiple-issue screening tests are both
statistically and psychologically more complex than single-issue tests, with mean accuracy in the low
to mid .80s, depending on the test format and other parameters. Like all scientific test results,
polygraph test results are a description of the margin of uncertainty or level of confidence associated
with a categorical conclusion. [Refer to Attachment A for more information about criterion accuracy
estimates for the variety of polygraph techniques in use today.]

Steyn polygraph report - Page 4 of 10


Standards of Practice of the American Polygraph Association – like other areas of forensic science –
now require the calculation of polygraph test results via computer algorithm in addition to manual
analysis methods for evidentiary examination (ie., those exams for which the test result is intended for
use as a basis of evidence in a courtroom or legal proceeding). Computer algorithms offer the
advantage of automated reliability, whereas manual polygraph test data analysis will include some
aspects that have remained visual and subjective. Reliability of a test result is strengthened when the
results of automated computer computer scoring algorithms concur with those from manual test data
analysis. [Refer to Attachment B for more information about the scientific basis for polygraph testing.]

PRETEST INTERVIEW:

Ms. Steyn reported that she is in good health, and denied any illnesses or injuries at the time of the
examination. Ms. Steyn explained that she is experiencing considerable stress due to controversy
surrounding her recently published book and the recent death of her co-author.

Ms. Steyn reported that she began studying law and then entered the field of journalism at
approximately age 20. Ms. Steyn explained that she worked as a journalist in Pretoria, Cape Town,
Johannesburg and also while residing outside of South Africa. Ms. Steyn reported that she began
working at the Cape Times during 1986 and remained at that publication through the period of time
described in her 2018 book Lost Boys of Bird Island. Ms. Steyn indicated that she left journalism in
1997 and returned for a period of time while her son was in college. Ms. Steyn denied belonging to a
political party. Ms. Steyn explained that she currently resides in Hermanus where she is co-owner of a
bookstore.

Upon inquiry, Ms. Steyn reviewed and summarized the information described in her 2018 book, Lost
Boys of Bird Island. Ms. Steyn explained that her co-author Mark Minnie died in in 2018 of apparent
suicide shortly after the publication of the book. Ms. Steyn explained that Mr. Minnie gave no
indication of his impending suicide during his interactions with her.

When asked to discuss the persons described in her 2018 book, Ms. Steyn indicated that the book
recounts the death of a businessman named David Allen who was arrested on charges of possessing
pornography as well as for allegedly committing sexual offenses with minor persons. Mr. Allen
subsequently suicided during the hours before his court appearance February 25, 1987. Information in
the book indicated that Mr. Allen was the owner of a successful diving salvage business in addition to a
successful guano business that operated on Bird Island. Ms. Steyn explained that she has previously
known Mr. Allen’s brother as a colleague while working as a news reporter at the Rand Daily Mail – a
paper that was subsequently closed.

Ms. Steyn confirmed that reported information suggests that Mr. Allen was last seen by a friend named
Robert Ball. Published information described Mr. Allen as having been found with a gun in hand and
that the gun was placed on a cement block wall by the prior to police arriving on the scene. Mr. Allen
was described by Mr. Minnie’s sections of the book as having implicated, at the time of his arrest, three
cabinet ministers in a ring of paedophilic activity. Prior to his arrest Mr. Allen was known to the police
and community as a salvage diver and reserve police Lieutenant.

Information in Ms. Steyn’s 2018 book describes the alleged paedophilic activity as taking place at Bird
Island off the coast of Port Elizabeth, and indicates that Mr. Allen was alleged to have arranged for the

Steyn polygraph report - Page 5 of 10


underage male youths to accompany a group of adults on a fishing trip, during which the adults
reportedly engaged in sexual contact with the underage male youths. Ms. Steyn explained that Bird
Island was habitable, including a house used by Mr. Allen and other buildings. Ms. Steyn confirmed
that information reported in the book included reports that at least one underage male youth required
medical care due to injuries sustained from being anally penetrated with a pistol, and that her co-author
Mr. Minnie had visited one male youth in a hospital, reportedly after suffering an anal injury. Ms Steyn
reported that a source stated that some underage male youths may not have returned from the island,
and indicated that another source did so after publication.

When asked, Ms. Steyn discussed the death of a cabinet minister named John Wiley, who was found
dead at his home in Cape Town on March 29 1987. Mr. Wiley was one of the cabinet ministers who
was reportedly implicated as part of a ring of paedophilic activity by David Allen following his arrest
during February 1987. Mr. Wiley’s death was determined to be an act of suicide, and reports indicate
that Mr. Wiley’s son entered Mr. Wiley’s locked room through a window at the time of his death
because the spare key could not be located. Ms. Steyn’s 2018 book implicates Mr. Wiley as having
attended the fishing trips to Bird Island during which the implicated adults had engaged in sexual
contact with underage male youths. Ms. Steyn explained that Mr. Wiley’s son is presently serving as a
member of the provincial parliament.

Ms. Steyn then discussed another cabinet minister named Magnus Malan, who has been deceased since
2011, and who was implicated in her 2018 book as a cabinet member was was also described by
sources as involved in the fishing trips on Bird Island. Ms. Steyn explained that Mr. Malan was first
identified (publicly) as involved in this matter in an article published in Playboy Magazine in about
May 1995.

When asked to discuss the third cabinet minister who was un-named in her 2018 book, Ms. Steyn
explained that although not named in her 2018 book, a former cabinet minister had – subsequent to the
publication of her 2108 book – revealed himself to other media reporters as the un-named third cabinet
minister. Ms. Steyn indicated that her sourced information implicated he was present with Wiley and
others on Bird Island during a fishing trip that included alleged sexual contact between adults and
underage male youths, but did not implicate him as present during the incident when a youth was anally
injured.

Upon inquiry, Ms. Steyn reviewed information reported in her 2018 book by Mr. Minnie, indicating
that the Senior Public Prosecutor, John Scott, had stopped the 1987 investigation into the paedophilia
ring. Ms. Steyn explained that Mr. Scott did not respond to email inquiry from Mr. Minnie during the
completion of the 2018 book Lost Boys of Bird Island.

When asked, Ms. Steyn discussed that Mr. Minnie had contact with a Matron (charge nurse) who
worked at the hospital where an underage male youth had allegedly received surgical care following an
injury that was possibly sustained while being anally penetrated by a pistol. Ms. Steyn explained that
her sourced information included two different male youths who were were hospitalized due to injuries,
and that sources indicated that one of those male youths may have experienced the discharge of a pistol
in his rectum while on the island. Ms. Steyn explained that Mr. Minnie’s description – of having visited
an injured male youth in a hospital – involved an injury that was not sustained on Bird Island, though
the youth claimed to Mr. Minnie that he had been taken to Bird Island as well. When questioned

Steyn polygraph report - Page 6 of 10


further, Ms. Steyn indicated that although there remains confusion about the two boys, the one
interviewed by Mr. Minnie was not the one who suffered the alleged discharge of a pistol in his rectum.

When asked, Ms. Steyn discussed her attempts to contact a surgeon who remained un-named in her
2018 book, and who may have cared for a hospitalized male youth who is implicated as having been
injured on Bird Island. Upon inquiry, Ms. Steyn provided the name of the un-named surgeon to this
examiner, and explained that she published the text of her email message to that un-named surgeon in
her book, along with the reply from the un-named surgeon. Ms. Steyn explained that she requested
information from the un-name surgeon regarding his treatment of a male youth patient with the
described injuries, and further explained that the un-named surgeon replied only that confidentiality
laws prevented him from providing information. Ms. Steyn further indicated that the reply from the un-
named surgeon, although not affirmative, did not include assertive denial of having cared for such a
patient. Ms. Steyn explained that discussions with the publisher led to withholding the name of that
surgeon from the published text of her 2018 book

When asked to discuss another information source that was described in her 2018 book, Ms. Steyn
reviewed information regarding “William Hart” - which she indicated as a pseudonym that was used to
protect the identity and person of the source who was reportedly a police informant. Ms. Steyn reported
that she had not met this source herself, though her co-author Mr. Minnie had described the source to
her alcoholic and/or addicted to drugs and also involved sexually with David Allen.

When asked to discuss an un-named information source that was referred to in her 2018 book as Mr X,
Ms. Steyn provided the name of this source to this examiner, and explained that he provided
information about Mr. Wiley. Ms. Steyn further explained that subsequent discussion with her publisher
led to the inclusion of the information while concealing the identity of the information source.

Ms. Steyn was not asked to discuss the name or identity of a source described only as a well known
senior police officer because information in her 2018 book did not indicate information that was clearly
sourced to this person.

When asked, Ms. Steyn denied fabricating any of the allegations about the ring of paedophilic activity
described in her 2018 book Lost Boys of Bird Island. When questioned further, Ms. Steyn denied
including of the allegations in her 2018 book without an actual human source. Ms. Steyn further denied
falsifying any of her reported information sources, and denied falsifying the actual existence of any of
the human information sources for her 2018 book.

TEST FORMULATION AND RESULTS:

This test was conducted under the Standards of Practice established by the American Polygraph
Association. All of the questions for the in-test phase of the examination were completely reviewed
prior to recording, and Ms. Steyn indicated that he completely understood the scope and meaning of
each question. Prior to the recording of Ms. Steyn’s physiological responses to the test questions
pertaining to the issue/s under investigation, an acquaintance-functionality test was completed to ensure
Ms. Steyn’s familiarity with the recording sensors and procedures, and to ensure the polygraph
instrument and recording sensors were functioning as intended.

Steyn polygraph report - Page 7 of 10


A four- question diagnostic test (Utah 4-question format, referred to as the “Raskin technique”) was
administered using a Lafayette LX6 polygraph instrument. Consistent with standardized practices a
known-solution acquaintance/functionality test was conducted prior to recording the actual test data.
The purpose of the acquaintance/functionality test is to familiarize the examinee with the test sensors
and testing procedure and to ensure that the polygraph instrument and recording sensors are functioning
as intended at the time of the examination. The test sensor array consisted upper and lower respiration
sensors, an electrodermal sensor, a cardio-activity sensor, vasomotor sensor, and activity senor. This
examination was completed using standardized directed-lie comparison questions. The question
sequence was varied for each iteration, consistent with standardized field practices, in order to reduce
the potential influence of position effects not related to the content of the test stimuli and to discourage
Ms. Steyn from attempting to memorize, anticipate or guess at the content of each subsequent test
stimulus or the question sequence.

Visual inspection of the recorded physiological data indicated that test data were somewhat artifacted
though of satisfactory interpretable quality with discernible responses to test stimuli that were
distinguishable from tonic activity and other non-diagnostic/artifact data. Five repetitions of the
reviewed question sequence were recorded.

Ms. Steyn’s consisted of the following scored questions:

Question (R5): In your book Lost Boys of Bird Island did you fabricate any of your reported
information sources?
Answer: NO (No Significant Reactions Indicative of Deception)

Question (R6): Did you falsify any of the allegations you wrote about those persons in your book
Lost Boys of Bird Island?
Answer: NO (No Significant Reactions Indicative of Deception)

Question (R8): Regarding your book Lost Boys of Bird Island did you falsify any of the reported
allegations about those persons?
Answer: NO (No Significant Reactions Indicative of Deception)

Question (R9): In your book Lost Boys of Bird Island did you include any of those allegations
without an actual human source?
Answer: NO (No Significant Reactions Indicative of Deception)

ANALYSIS:

Recorded physiological data were evaluated with the Empirical Scoring System – Multinomial (ESS-
M). The ESS-M is an evidence-based, standardized protocol for polygraph test data analysis using a
Bayesian classifier with a multinomial reference distribution. Bayesian analysis treats the parameter of
interest (i.e., deception or truth-telling) as a probability value for which the test/experimental data,
together with the prior probability, are a basis of information to calculate a posterior probability. The
multinomial reference distribution is calculated from the analytic theory of the polygraph test - that
greater changes in physiological activity are loaded at different types of test stimuli as a function of
deception or truth-telling in response to relevant target stimuli. The reference distribution for this exam

Steyn polygraph report - Page 8 of 10


describes the probabilities associated with the numerical scores for all possible combinations of all
possible test scores for 3 to 5 presentations of 4 relevant questions using an array of 4 recording
sensors: respiration, electrodermal, cardiovascular, and vasomotor.

These results were calculated using a prior probability of 0.5 for which the prior odds of truth-telling
were 1 to 1. A credible-interval (Bayesian confidence interval) was also calculated for the posterior
odds of truth-telling using the Clopper-Pearson method and a one-tailed alpha = .05. The credible-
interval describes the variability of the analytic result by treating the test statistic (posterior odds) as a
random variable for which the limits of the credible interval can be inferred statistically from the test
data. A test result is statistically significant when the lower limit of the credible interval has exceeded
the prior odds.

A categorical test result was parsed from the probabilistic result using two-stage decision rules. Two-
stage rules are based on an assumption that the criterion variance of the test questions is non-
independent, and make use of both the grand total and subtotal scores to achieve a categorical
classification of the probabilistic test result. The grand total score of +8 equaled or exceeded the
required numerical cut-score (+3). These data produced a posterior odds of truth-telling was 3.9 to 1,
for which the posterior probability was 0.79. Bayes factor for this result is 3.9, indicating that posterior
information are 3.9 times stronger than information prior to testing. The The lower limit of the 1-alpha
Bayesian credible interval was 2.5 to 1, which exceeded the prior odds (1 to 1). This indicates a 95%
likelihood that the posterior odds of truth-telling exceed the prior odds. These analytic results support
the conclusion that there were NO SIGNIFICANT REACTIONS INDICATIVE OF DECEPTION in
the loading of recorded changes in physiological activity in response to the relevant test stimuli during
this examination. [Refer to Attachment C for the printed summary of the ESS-M analysis.]

Data were subject to additional analysis using the Objective Scoring System – version 3 (OSS-3), an
automated computer algorithm for polygraph test data analysis. Automated analysis with the OSS-3
algorithm involves automated feature extraction, automated numerical transformation and data
reduction, automated calculation of the statistical likelihood function, and automated execution of
structured decision rules used to parse the categorical test result from the numerical and statistical
information.

This OSS-3 includes a feature that allows a human examiner to mark and remove artifacted data
segments from the statistical analysis. In addition to removing artifacts from analysis, the OSS-3
algorithm includes a function to calculate the statistical likelihood that number and locations of
observed data artifacts conform to a random pattern. When the statistical likelihood is significant (alpha
= .05) the result can be interpreted as supportive of a conclusion that the examinee may have engaged
in faking during testing. Artifacts observed during this examination are not statistically significant for
test faking, and are within the expected or normal range for random data artifacts.

Data were analyzed using alpha = .05 for deceptive classifications and alpha = .05 for truthful
classifications. After excluding artifacted data segments from analysis, the OSS-3 algorithm produced a
value of p = .033 using a reference distribution of confirmed deceptive field polygraph examinations.
This result was statistically significant, and can be interpreted as the statistical likelihood that the
observed data were produced by a deceptive person. Another analysis was conducted without the
removal of observed data artifacts, using alpha = .05 for deceptive classifications and alpha = .05 for
truthful classifications. This additional analysis produced a value of p = .002 using a reference

Steyn polygraph report - Page 9 of 10


distribution of confirmed deceptive field polygraph examinations. The additional analytic result was
statistically significant, and can be interpreted as the statistical likelihood that the observed data were
produced by a deceptive person. These analytic results support a classification of NO SIGNIFICANT
REACTIONS INDICATIVE OF DECEPTION. [Refer to Attachment D for the printed summary of the
OSS-3 computer algorithm analysis, and Attachement E for a summary of the analysis without artifact
marks.]

Based on consideration of these analyses of the observed test data, it is the expert opinion of this writer
that Ms. Steyn was truthful when answering NO to the above referenced questions about whether she
falsified any of the allegations or information sources in her 2018 book Lost Boys of Bird Island.

Respectfully,

Raymond Nelson,
Polygraph Examiner

/rn

Attachments [5]:
A. APA (2011) Meta-analytic Survey of Validated Polygraph Techniques
B. Nelson (2015) Scientific Basis for Polygraph Testing
C. Summary of ESS-M analysis
D. Summary of OSS-3 computer algorithm analysis
E. Summary of OSS-3 computer algorithm without artifact marks

Steyn polygraph report - Page 10 of 10


Attachment A: Steyn - polygraph report April 2019

Meta-Analytic Survey of Criterion Accuracy of


Validated Polygraph Techniques

Report Prepared For

The American Polygraph Association Board of Directors

Nate Gordon, President (2010-2011)

by

The Ad-Hoc Committee on Validated Techniques

Mike Gougler, Committee Chair

Raymond Nelson, Principal Investigator

Mark Handler

Donald Krapohl

Pam Shaw

Leonard Bierman
Executive Summary

Executive Summary

In 2007 the American Polygraph concept is that of validation, which, as it


Association (APA) adopted a Standard of applies to PDD exams, is stipulated by the
Practice, effective January 1, 2012, that APA Standards of Practice (Section 3.2.10) to
requires APA members to use validated refer to the combination of: 1) a test question
Psychophysiological Detection of Deception format that conforms to valid principles for
(PDD) examination techniques that meet target selection, question construction, and
certain levels of criterion accuracy.1 Those in-test presentation of the test stimuli, and 2)
requirements state that event-specific a validated method for test data analysis as it
diagnostic examinations used for evidentiary applies to a specified test question format.
purposes must be conducted with techniques Although many factors may affect the overall
that produce a mean criterion accuracy level effectiveness of PDD examinations, these two
of .90 or higher, with an inconclusive rate of parts are recognized as fundamental to the
.20 or lower. Diagnostic examinations criterion accuracy of PDD examinations. The
conducted using the paired-testing protocol accuracy of all tests is contingent upon these
must produce a mean criterion accuracy level two activities: obtaining a sufficient quantity
of .86 or higher, with inconclusive rates of .20 of diagnostic information, and interpreting the
or lower. Examinations conducted for information correctly. The two-fold purpose of
investigative purposes must be conducted this meta-analysis was to advise the APA and
with techniques that produce a mean criterion its membership about which PDD techniques
accuracy level of .80 or higher, with satisfy the standard practice requirements
inconclusive rates of .20 or lower.2 The goal is that take place January 1, 2012, and to an-
to eliminate the use of un-standardized, non- swer questions about our present knowledge-
validated or experimental techniques in field base regarding the criterion validity of PDD
settings where decisions may affect individual techniques as they are presently used.
lives, community safety, professional integrity,
and national security. The ad hoc committee to examine the
evidence on the criterion accuracy of
There exists today a confusing array of polygraph techniques was appointed by APA
test question formats that are at once similar President Nate Gordon during the March 2011
and dissimilar, and for which there are also meeting of the Board of Directors. The
alternatives in the selection of a method for committee was composed of Past President
test data analysis. Equally confusing is the and Board Director Mike Gougler (committee
abundance of published research, and the chair), Past-President and Editor-in-Chief Don
meaning and applicability of that research to Krapohl, President Elect Pam Shaw, Board
the techniques used in field settings. The APA Director Raymond Nelson, and APA Members
Board of Directors assumed responsibility for Mark Handler and Leonard Bierman.
organizing this information in the form of a
systematic review and meta-analysis of the The committee also took into
published scientific literature which describes consideration that there are both financial
the criterion validity of presently available and proprietary issues attached to the
polygraph techniques. In the course of doing formulation of such a list. The stakeholders
this, it has at times been necessary to define represent a diverse group of professionals and
what appear to be obvious concepts. One such interests. The effectiveness of the APA and the

1 Criterion accuracy refers generally to the degree to which a test result corresponds with what the test is designed
to detect. In the field of PDD, criterion accuracy denotes the ability of a combination of testing and scoring
techniques to discriminate between truthful and deceptive examinees, and ranges from 0.00 for no validity to 1.00
for perfect validity. Criterion accuracy is one form of validity, and in some research reports it may be referred to as
decision accuracy, or just accuracy.

2 Near the completion of this study the APA Board of Directors enacted a change in standards, endorsing the use of

PDD screening techniques for which research indicates an accuracy rate that is significantly greater than chance.

Polygraph, 2011, 40(4) 196


Ad Hoc Committee on Validated Techniques

credibility of the polygraph profession required to our ability to study questions of causality
the committee to give precedence to the and construct validity. However, the
accuracy and integrity of the research review generalizability of laboratory studies is
over the financial and personal interests of complicated by the fact that these studies may
any individual developer of PDD testing not represent the broad range of variables
techniques. The committee’s default approach thought to influence the results of field
was an inclusionary review process in which examinations. They are therefore presumed to
any stakeholder could submit supportive data have ecological validity that is weaker, to some
and information for consideration. This did unknown degree, than that of field studies.
not mean that anything submitted would
automatically be included or endorsed as The committee took the position that
valid, but it did mean that all recommenda- both field and laboratory studies have
tions would be considered. advantages and disadvantages, and that
neither type alone would be sufficient to study
The committee began its process with all of the issues of concern to polygraph
a discussion of the merits and strengths of researchers. Both types of research are of vital
laboratory and field research. Field studies are importance to the study and development of
important to polygraph research as these knowledge in the polygraph profession.
studies have the advantage of ecological Differences between criterion accuracy of field
validity3 and are therefore assumed to have and laboratory studies were found to be
increased generalizability. However, the statistically small and insignificant by the
generalizability of field studies is compromised 2002 report on the polygraph by the US
to some unknown extent by the selection National Research Council. For the purpose of
process which necessarily depends on the reviewing the current state of validation of
availability of often-incomplete confirmation existing polygraph techniques, the committee
data. Real world confirmation data are gave field and laboratory studies equal
selective, neither random nor representative of consideration.
all data, and confirmed cases more often may
have correct PDD results than do unconfirmed The committee compiled a list of
cases. As a result, field studies may studies that satisfied the qualitative and
overestimate PDD decision accuracy to some quantitative requirements for inclusion in the
degree. While field studies are highly useful review, and for which there existed two or
for studying correlations, they provide more satisfactory publications that describe
imperfect measures of criterion validity. generalizable evidence of criterion validity. The
committee formulated recommendations
Laboratory studies are also important regarding which techniques satisfied the
to polygraph research as these studies can requirements of the pending 2012 APA
more easily control and reduce research and provisions for criterion accuracy. The
sampling biases. Use of experimental and committee then completed a meta-analysis
quasi-experimental research designs, along that bench-marked the findings against the
with random sampling and random criterion 2012 APA standards for evidentiary, paired
assignment, can increase the generalizability testing, and investigative examinations. In
and repeatability of research results. Because addition, statistical tests were completed to
of their ability to control a greater number of check for study integrity, and to search for
variables, laboratory studies are fundamental inconsistencies and outlier results.

3 Ecological validity addresses how well the experimental settings, processes, subjects and materials match those in

real-life conditions. Though it is not the same as external validity, greater ecological validity may provide more
confidence that the findings of the study will generalize to other settings.

197 Polygraph, 2011, 40(4)


Executive Summary

Inclusion in the research review measurements for PDD test reliability were
required that studies in the meta-analysis be reported in the published literature, and all
published in Polygraph or other peer-reviewed were accepted for inclusion.7 The committee
scientific publications.4 Studies were also evaluated the reliability, generalizability
considered for selection if they were published and representativeness of the sample
by an academic degree-granting institution distributions through multivariate ANOVAs
that was accredited by an accrediting agency using the deceptive and truthful scores. It was
recognized by the US Department of expected that multiple samples drawn from
Education or foreign equivalent. In addition, the same underlying population, administered
research publications of studies funded by the same PDD technique, and scored with the
government agencies were also considered for same TDA method would replicate among the
selection. Edited academic texts, including sampling distributions of scores. It was also
individual chapters, were considered. expected that aggregation of the results of
However, studies available only in self- replicated sampling distributions would be
published books were excluded. Additional more representative and generalizable than
qualitative requirements were that selected the results from any single sampling distribu-
studies must have employed a recognizable tion. In addition to sample size information, a
PDD technique for which a published minimum of four statistical values were
description exists for the test structure, test required for the meta-analysis: test sensitivity
question sequence, and administration. and specificity, and the inconclusive rates for
Studies must also have used a recognizable guilty and innocent cases.
test data analysis (TDA, chart interpretation)
method and included a description of the It was important to the credibility of
features, numerical transformations (score the committee’s findings that studies be
assignments), decision rules and normative accepted or endorsed based on the merits of
data or cutscores. In addition, studies must the included studies. For this reason, access
have used instrumentation and component to the published evidence and raw data was
sensors that reflect common field practices. made a priority, and the list of validated
Selected studies must have used a variety of techniques was constructed according to a
confirmation criteria including examinee systematic review of published research. In
confessions, high quality forensic evidence, or addition to a re-analysis of the study data,
substantiated evidence that the crime was not studies were included when it was possible to
committed.5 Study results based on samples calculate a complete dimensional profile of
that were subject to experimental manipula- criterion accuracy.
tion (e.g., fatigue, intoxication, programmed
countermeasure use) were not included.6 Following the completion of the
literature survey, a list of all identified
Quantitative information requirements techniques was sent to the school directors or
included some form of reliability statistics for representatives of all APA accredited
each technique as evidence of the generaliza- polygraph schools. School directors or their
bility of the results. Several types of statistical representatives were invited to provide any

4The journal Polygraph instituted expert peer review in 2003. Articles published prior to that time were subject only
to editorial review. Because Polygraph is an important academic and historic resource, studies published prior to
2003 and without peer review were included in this meta-analysis if they satisfied all of the other qualitative and
quantitative requirements for selection.

5 One included study did not meet this requirement, and consisted only of cases confirmed by confession. Consistent

with the known concern about inflated accuracy estimations resulting from over-reliance on confession confirmation
for sample case selection, this study reported a near-perfect level of decision accuracy.

6Present standards for research and publication by the APA stipulate that principal investigators should not also
serve as study participants (i.e., examinees, examiners, or scorers). However, this was not a requirement in the past,
and studies were not excluded from the meta-analysis based on this criterion.

7 One included technique is without published evidence of inter-scorer reliability or agreement.

Polygraph, 2011, 40(4) 198


Ad Hoc Committee on Validated Techniques

published studies or citations to published have also been accepted for publication. The
studies for techniques that were not yet result of these additional efforts was that the
identified. One additional technique was complete array of techniques in common use
suggested for inclusion at that time.8 In a today was included in the meta-analysis.
follow-up mailing, a shorter list was sent to all
school directors, or representatives, and other Thirty-eight studies satisfied the
researchers involved in the development or qualitative and quantitative requirements for
validation of PDD techniques. This list inclusion in the meta-analysis. These studies
included only those techniques for which involved 32 different samples, and described
published and replicated studies were the results of 45 different experiments and
identified, along with another request to surveys. These studies included 295 scorers
provide any publications or citations for who provided 11,737 scored results of 3,723
techniques that were not yet included in the examinations, including 6,109 scores of 2,015
survey. Two additional studies were suggested confirmed deceptive examinations, 5,628
for inclusion at that time.9 Following the scores of 1,708 confirmed truthful exams.
submission of initial results to the APA Board Some of the cases were scored by multiple
of Directors, and presentation of a preliminary scorers and using multiple TDA methods.
version of this Executive Summary to the APA
membership, one additional, recently Table 1 at the end of this executive
published, study report was submitted in summary shows the findings for those
support of the IZT.10 methods for which there is published and
replicated evidence of criterion validity that
The committee contacted developers of satisfies the requirements of the APA
PDD techniques for which there were Standards of Practice. Five PDD testing and
insufficient published studies for inclusion in analysis combinations meet APA 2012
the meta-analysis, and volunteered assistance requirements for evidentiary testing, five for
to anyone requesting it. Two studies were paired-testing,11 and four for investigative
completed, have been accepted for examinations.12
publication, and are awaiting printing, for the
Backster You-Phase technique. This technique Two PDD techniques produced
was included in the meta-analysis. Additional accuracy rates that were outliers from and
studies were also completed for the Air Force inconsistent with the distribution of results
Modified General Question Test (AFMGQT), from all other techniques. They were the
the Federal You-Phase Technique, and the Integrated Zone Comparison Technique (IZCT)
Directed Lie Screening Test (DLST), which and the Matte Quadri-Track Zone Comparison

8 The Marcy Technique was suggested for inclusion. However, no published studies could be located regarding this
technique.

9 The Gordon et al (2000) study of the Integrated Zone Comparison Technique (IZCT) could not be included due to a

lack of adequate statistical information. The primary author informed the committee that the report was completed
without him seeing the data, which belong to the intelligence service of a foreign government and therefore
unavailable. Data for the Mangan, Armitage and Adams (2008) study of the Matte Quadri-Track Zone Comparison
Technique were provided to the committee, and this study was included.

10A portion of the data for the Shurani (2011) study of the IZCT was provided to the committee and this study was
included.

11 All PDD techniques that meet the criterion accuracy requirement for paired-testing also meet the standard

requirement for investigative testing, and those techniques that meet the standard requirement for evidentiary work
also meet the requirements for paired-testing and investigative testing.

12All techniques that employed three-position TDA methods consistently exceeded the 2012 boundary requirements
for inconclusive rates (20%). Because criterion accuracy rates for techniques with three-position TDA did not differ
significantly from seven-position criterion accuracy, field practices that involve an initial analysis with the three-
position TDA method may be considered acceptable if inconclusive results are resolved via subsequent analysis with
a TDA method that provided both accuracy and inconclusive rates that meet the requirements of the APA 2012
standards.

199 Polygraph, 2011, 40(4)


Executive Summary

Technique (MQTZCT). While it is within the comparison question techniques intended for
realm of possibility that these two techniques event-specific (single issue) diagnostic testing,
are superior to other techniques, studies in which the criterion variance of multiple
supporting them proved to have more relevant questions is assumed to be non-
unresolved methodological issues than others independent,14 produced an aggregated
included in this meta-analysis. In addition to decision accuracy rate of .890 (.829 - .951),
the committee’s discovery of anomalous with a combined inconclusive rate of .110
sampling distributions, both of these (.047 - .173). Comparison question PDD
techniques are supported by studies authored techniques designed to be interpreted with the
by the developers and proprietors, and for assumption of independence of the criterion
which the developer/proprietor functioned as variance of multiple relevant questions,
both principal investigator and study produced an aggregated decision accuracy
participant. From a scientific perspective, even rate of .850 (.773 - .926) with a combined
well designed research generated by advocates inconclusive rate of .125 (.068 - .183). The
of a method who have a vested interest in the combination of all validated PDD techniques,
outcome, and who act as participants and excluding outlier results, produced a decision
authors of the study report does not have the accuracy level of .869 (.798 - .940) with an
compelling power of research not so inconclusive rate of .128 (.068 - .187). Data at
encumbered by these factors. The techniques the present time are sufficient to support the
have been duly included here because they polygraph as highly accurate, but insufficient
met the more general requirements outlined in to support an assertion that PDD testing can
the APA Standards of Practice. The committee provide perfect or near-perfect accuracy.
advises that because of the potential impact
on examiner effectiveness that could result Excluding outlier results, multi-variate
from reliance on outlier results, it would be analysis showed there were no significant one-
prudent for examiners to exercise an extra way differences in decision accuracy for any of
measure of caution before accepting data from the PDD techniques that satisfy the
studies showing extraordinary effects before requirements of the APA 2012 standards of
they are subject to independent confirmation practice. Neither were there any significant
and extended analysis. one-way differences in PDD techniques at the
different levels of validation specified in the
A dimensional profile of criterion APA standards of practice, or for PDD
accuracy was calculated for each PDD techniques interpreted with decision rules
technique, including the unweighted average based on an assumption of independence
of the proportions of correct decisions for when compared with PDD techniques
deceptive and truthful cases, excluding interpreted with decision rules based on an
inconclusive results, along with the assumption of non-independence. This
unweighted average of the proportions of illustrates that the APA categorical
inconclusive results.13 Results were distinctions are arbitrary, not empirically
aggregated for techniques that satisfy the APA founded, and scientifically meaningless. These
2012 requirements for evidentiary testing, data are insufficient to support the notion that
paired testing, investigative testing, and for all any PDD technique is superior to another,
PDD techniques included in the meta- and instead suggest that differences in PDD
analysis. Excluding outlier results, question formats may be less important than

13The unweighted average was considered to be a more conservative and realistic calculation of the overall accuracy
of all PDD examination techniques. Calculation of the weighted average, or the simple proportion of correct
decisions, often results in higher statistical findings that are less robust against differences in base-rates and
therefore less generalizable.

14 Independence, in scientific testing, refers to assumptions about whether external factors that affect the criterion

state of each question (i.e. truthfulness about past behavior) is assumed to affect the criterion state of other
questions. In PDD testing, the results of multi-facet and multi-issue exams are interpreted with decision rules based
on the assumption of independence, while the results of event-specific single-issue examinations are more often
interpreted with decision rules based on the assumption of non-independence.

Polygraph, 2011, 40(4) 200


Ad Hoc Committee on Validated Techniques

previously assumed. Differences in PDD inclusion in the meta-analysis. Limitations of


techniques may be limited to assumptions the meta-analysis are discussed in the full
and procedural differences pertaining to TDA report.
methodologies intended for the investigation of
independent or non-independent investigation In closing, no attempt should be made
target questions. This should be subject to to represent the results of this meta-analytic
further study. survey as an enforceable policy or standard.
Although the dissemination of a list of
The present evidence supports an validated polygraph techniques, could be
argument that PDD testing can provide both viewed by some as a form of de facto APA
test sensitivity to deception and test specificity endorsement of those techniques, the actual
to truth-telling at rates that are significantly role of this meta-analysis is as a thorough
greater than chance when conducted and summary of the existing PDD literature.
interpreted with the assumptions of criterion Although policies may tend to remain fixed for
independence as well as non-independence periods of time, scientific evidence is
among the test questions. Evidence shows continuously evolving. The committee offers
that all PDD techniques included in the meta- that questions and discussions of test validity
analysis provide test sensitivity at rates that are a matter of science and not mere policy.
are significantly greater than chance. As such these questions are best answered by
However, the present evidence is insufficient scientific evidence. This meta-analysis should
to support that every PDD technique is be considered an information resource only,
capable of providing test specificity to truth- and no attempt should be made to represent
telling at rates that are significantly greater this list of PDD techniques as the final
than chance. Details can be found in the authority on PDD test validation. Although
complete report. completed with the goal of creating a
comprehensive and inclusive list of validated
If the present results were to be techniques, it remains possible that other
considered an overestimation of PDD studies and techniques exist but have not
accuracy, the major causes of that over- been included in this meta-analysis. There
estimation would be deficiencies in the exists in the published literature some
sampling methodologies. One such factor is evidence of validity for PDD techniques that
over-reliance on case confirmation via were not able to be included in this meta-
examinee confession, which may present the analytic survey. Of course, a meta-analysis
potential for the systematic exclusion of false- based on different study selection or inclusion
positive or false-negative errors for which no criteria may yield different results. Nothing in
confession would be obtained. Another this executive summary or the complete report
sampling concern would be publication or file- should be construed as preventing the use of
drawer bias, in which less favorable results any PDD technique for which the criterion
are not submitted for publication and thus accuracy level can be defended with scientific
not available for inclusion in a meta-analysis evidence. The information herein is provided
or other systematic review. Another potential to the APA Board to advise its professional
cause of accuracy overestimation would be the membership of the strength of validation of
lack of independence between the technique PDD techniques available at this time. This
developer, principal investigator and examiner information is intended only to ease the
participants in some included studies. If the burden on PDD professionals and to help
present results were to be considered an make evidence-based decisions regarding the
under-estimation of PDD accuracy, the major selection of PDD techniques for use in field
cause might be argued to be deficiencies in settings. It may also assist program
the ecological validity of experimental and administrators, policy makers, and courts to
survey methodologies of the included studies. make evidence-based decisions about the
The present results are intended only to informational value of PDD test results in
summarize the presently available general.
publications that satisfy the requirements for

201 Polygraph, 2011, 40(4)


Executive Summary

Table 1. Mean (standard deviation) and {95% confidence intervals} for correct decisions (CD) and inconclusive
results (INC) for validated PDD techniques. References can be found at the end of the complete report.
Evidentiary Techniques/ Paired Testing Techniques/ Investigative Techniques/
TDA Method TDA Method TDA Method
Federal You-Phase / ESS1 AFMGQT4,8 / ESS5 AFMGQT6,8 / 7 position
CD = .904 (.032) {.841 to .966} CD = .875 (.039) {.798 to .953} CD = .817 (.042) {.734 to .900}
INC = .192 (.033) {.127 to .256} INC = .170 (.036) {.100 to .241} INC = .197 (.030) {.138 to .255}
Event-Specific ZCT / ESS Backster You-Phase / Backster CIT7 / Lykken Scoring
CD = .921 (.028) {.866 to .977} CD = .862 (.037) {.787 to .932} CD = .823 (.041) {.744 to .903}
INC = .098 (.030) {.039 to .157} INC = .196 (.040) {.117 to .275} INC = NA
IZCT / Horizontal2 Federal You-Phase / 7 position DLST (TES)8 / 7 position
CD = .994 (.008) {.978 to .999} CD = .883 (.035) {.813 to .952} CD = .844 (.039) {.768 to .920}
INC = .033 (.019) {.001 to .069} INC = .168 (.037) {.096 to .241} INC = .088 (.028) {.034 to .142}
MQTZCT / Matte3 Federal ZCT / 7 position DLST (TES)8 / ESS
CD = .994 (.013) {.968 to .999} CD = .860 (.037) {.801 to .945} CD = .858 (.037) {.786 to .930}
INC = .029 (.015) {.001 to .058} INC = .171 (.040) {.113 to .269} INC = .090 (.026) {.039 to .142}
Utah ZCT DLT / Utah Federal ZCT / 7 pos. evidentiary
CD = .902 (.031) {.841 to .962} CD = .880 (.034) {.813 to .948} -
INC = .073 (.025) {.023 to .122} INC = .085 (.029) {.028 to .141}
Utah ZCT PLT / Utah
CD = .931 (.026) {.879 to .983} - -
INC = .077 (.028) {.022 to .133}
Utah ZCT Combined / Utah
CD = .930 (.026) {.875to .984} - -
INC = .107 (.028) {.048 to .165}
Utah ZCT CPC-RCMP Series A / Utah
CD = .939 (.038) {.864 to .999} - -
INC = .185 (.041) {.104 to .266}

1
Empirical Scoring System.
2
Generalizability of this outlier result is limited by the fact that no measures of test reliability have been published for this technique. Also, significant
differences were found in the sampling distributions of the included studies, suggesting that the samples data are not representative of each other, or
that the exams were administered and/or scored differently. One of the studies involved a small sample (N = 12) that was reported in two articles, for
which the participating scorer was also the technique developer. One of the publications described the study as a non-blind pilot study. Both reports
indicated that one of the six truthful participants was removed from the study after making a false-confession. The reported perfect accuracy rate did
not include the false confession. Neither the perfect accuracy nor the .167 false-confession rate are likely to generalize to field settings.
3
Generalizability of this outlier result is limited by the fact that the developers and investigators have advised the necessity of intensive training
available only from experienced practitioners of the technique, and have suggested that the complexity of the technique exceeds that which other
professionals can learn from the published resources. The developer reported a near-perfect correlation coefficient of .99 for the numerical scores,
suggesting an unprecedented high rate of inter-scorer agreement, which is unexpected given the purported complexity of the method. Additionally,
the data initially provided to the committee for replication studies included only those cases for which the scorers arrived at the correct decision,
excluding scores from those cases for which the scorers did not achieve the correct decision. Missing scores were later provided to the committee for
both the Mangan et al (2008) and Shurani and Chavez (2009) studies. However, the resulting sampling means were different from those reported for
both replication studies. Because of these discrepancies, the statistical analysis was not re-calculated with the missing scores, and the reported
analysis reflects the sampling distribution means as reported. Sampling means for replication studies should be considered devoid of error or
uncontrolled variance.
4
Two versions exist for the AFMGQT, with minor structural differences between them. There is no evidence to suggest that the performance of one
version is superior to the other. Because replicated evidence would be required to reject a null-hypothesis that the differences are meaningless, and
because the selected studies include a mixture of both AFMGQT versions, these results are provided as generalizable to both versions. AFMGQT
exams are used in both multi-facet event-specific contexts and multi-issue screening contexts. Both multi-facet and multi-issue examinations were
interpreted with decision rules based on an assumption of criterion independence among the RQs.
5
The AFMGQT produced accuracy that is satisfactory for paired testing only when scored with the Empirical Scoring System.
6
There are two techniques for which there are no published studies but which are structurally nearly identical to the AFMGQT: the LEPET and the
Utah MGQT. Validity of the AFMGQT can be generalized to these techniques if scored with the same TDA methods.
7
Concealed Information Test, also referred to as the Guilty Knowledge Test (GKT) and Peak of Tension test (POT). The data used here were
provided in the meta-analysis report of laboratory research by MacLaren (2001).
8
Studies for these PDD techniques were conducted using decision rules based on the assumption of criterion independence among the testing
targets. Accuracy of screening techniques may be further improved by the systematic use of a successive-hurdles approach.

Polygraph, 2011, 40(4) 202


Ad Hoc Committee on Validated Techniques

Report of the Ad Hoc Committee on Validated Techniques

Abstract
Meta-analytic methods were used to calculate the effect size of validated psychophysiological
detection of deception (PDD) techniques, expressed in terms of criterion accuracy. Monte Carlo
methods were used to calculate statistical confidence intervals. Results were summarized for 45
different samples from experiments and surveys, including scored results from 295 scorers who
provided 11,737 scored results of 3,723 examinations, including 6,109 scores of 2,015 confirmed
deceptive examinations, 5,628 scores of 1,708 confirmed truthful exams. Fourteen different PDD
techniques were supported by a minimum of two published studies each that satisfied the
qualitative and quantitative requirements for inclusion in the meta-analysis. Results for the
individual studies, and for different PDD techniques, were compared using multivariate analytic
methods. Two studies produced outlier results that are not accounted for by the available
evidence and which are not generalizable. Excluding outliers, there were no significant differences
in criterion accuracy between any of the PDD techniques supported by the selected studies.
Excluding outlier results, comparison question techniques intended for event-specific (single issue)
diagnostic testing, in which the criterion variance of multiple relevant questions is assumed to be
non-independent, produced an aggregated decision accuracy rate of .890 (.829 - .951), with a
combined inconclusive rate of .110 (.047 - .173). Comparison question PDD techniques designed
to be interpreted with the assumption of independence of the criterion variance of multiple relevant
questions (multiple-issue and –facet) produced an aggregated decision accuracy rate of .850 (.773 -
.926) with a combined inconclusive rate of .125 (.068 - .183). The combination of all validated
PDD techniques, excluding outlier results, produced a decision accuracy of .869 (.798 - .940) with
an inconclusive rate of .128 (.068 - .187).

Introduction The origins of all modern CQT formats


can be traced to Reid (1947) who showed that
Is the polygraph scientifically valid? some form of comparison question (CQ),
How accurate is the polygraph? The simplicity intended to evoke a response from a truthful
of these common questions implies that the examinee, could improve test accuracy and
accuracy of psychophysiological detection of reduce the occurrence of false-positive errors.
deception (PDD) tests can be described with a CQT formats will often fall into one of two
simple answer or with a single numerical major families of techniques: techniques that
index. The present approach to answering emerged as modifications of the technique
these and other questions about criterion described by Reid (1947), and techniques that
validity1 is a meta-analytic review of all emerged as modifications of the technique
available studies of criterion accuracy for all described by Backster (1963). These
PDD examination techniques. Because the techniques are conducted using differing,
comparison question test (CQT) is the most though often similar, procedures based on
commonly used and researched of all PDD differing assumptions. These different
techniques, this analysis will be primarily assumptions and procedures can yield
directed toward the CQT. There will be a differences in test performance or test
limited discussion of research supporting the accuracy. Some techniques are highly
use of the Concealed Information Test (CIT). theoretical about the exact nature and cause

1The terms “accuracy,” “test accuracy,” “criterion accuracy,” and “criterion validity” are used interchangeably and
synonymously throughout this document. The term “decision accuracy” is also used to describe criterion validity,
but in a more limited sense, referring only to the accuracy of decisions, excluding inconclusive results. In a more
complete sense, the term “criterion accuracy” refers to a dimensional set of concerns involving all aspects of test
accuracy.

203 Polygraph, 2011, 40(4)


Validated Techniques

of emotional or cognitive activity and resultant MGQT examinations are commonly


psychophysiological changes. Other tech- interpreted with decision rules based on an
niques have emphasized an evidence-based assumption that the criterion variance of the
scientific approach which forgoes unproven relevant question (RQ) stimuli is independent.
hypotheses and complex psychological As a matter of field practice, both families of
assumptions about the exact thoughts and techniques have at times been used under
emotions of the examinee. assumption of both independence and non-
independence among the RQs.
In general, the family of Zone
Comparison Test2 (ZCT) formats that emerged Other Reviews
from the work of Backster (1963) has been Previous systematic reviews have been
used most effectively for event-specific completed in an attempt to provide objective
diagnostic testing. ZCT questions are answers to the question of PDD accuracy and
formulated to describe the examinee's reconcile this with claims of perfection.
involvement in a single known or alleged Abrams (1973) reviewed polygraph validity
behavioral issue of concern, and are studies dating back to the early part of the
interpreted with decision rules based on an 20th century, and reported an accuracy rate
assumption of non-independence3 of the of .980. Later, Abrams (1977) reported the
criterion variance of the test questions. In average accuracy of polygraph validity studies
contrast, the family of Modified General to be .910. Still later, Abrams (1989) summa-
Question Test4 (MGQT) formats that has rized the accuracy of polygraph tests as .880.
emerged from the work of Reid (1947) are
intended to describe and evaluate the Ansley (1983) reported the results of
examinee's involvement in different behavioral 1,964 laboratory cases and 1,113 field cases
roles or different levels of involvement in a and described a decision accuracy level of
known or alleged incident. Although research .968, excluding inconclusive results.
has supported the CQT as capable of However, accuracy rates were not reported
providing accuracy at levels that are separately for criterion deceptive and truthful
significantly greater than chance, previous cases, and inconclusive rates were not
research (Barland, Honts & Barger 1989; reported. At that time the Relevant-Irrelevant
Podlesny & Truslow, 1993; Research Division technique was reported as more accurate
Staff, 1995a, 1995b) has not supported the (.960) than CQT methods (.952). Ansley
effectiveness of polygraph questions at (1983) reported the accuracy of CIT formats to
pinpointing the exact behavioral role or level be .912. Later, Ansley (1990) summarized the
of involvement within a event-specific results of 10 field examinations, involving
examination. In addition to their use in multi- 2,042 criminal cases since 1980, reporting an
facet investigations of known or alleged overall accuracy rate of .980 for deceptive
incidents, MGQT techniques are easily cases and .970 for truthful cases, using the
adapted to use in multi-issue screening decisions of the original examiners. Ansley
contexts in which test questions are (1990) also described the results of 11 studies
formulated to describe the examinee's possible of blind evaluations of 922 criminal
involvement in several different behaviors for examinations, reporting accuracy levels of
which there is no known incident or .900, with a reported accuracy rate of .940 for
allegation. Both multi-facet and multi-issue deceptive cases and .890 for truthful cases.

2 Sometimes referred to as the historically correct expression “Zone Comparison Technique” as well as the “Zone of

Comparison Technique.”

3 Independence, in scientific testing, refers to assumptions about whether external factors that affect the criterion
state of each question (i.e. truthfulness about past behavior) is assumed affect the criterion state of other questions.
In PDD testing, the results of multi-facet and multi-issue exams are interpreted with decision rules based on the
assumption of independence, while the results of event-specific single-issue examinations are more often interpreted
with decision rules based on the assumption of non-independence.

4 Also referred to as the Modified General Question Technique.

Polygraph, 2011, 40(4) 204


Ad Hoc Committee on Validated Techniques

Honts and Peterson (1997) and Raskin results of the polygraph, over .900, but
and Honts (2002) reported the accuracy of the cautioned that offenders also claimed a false
polygraph as exceeding .900. It is consistent admission rate of approximately .050.
with the .900 accuracy estimation of Raskin
and Podlesny (1979). In contrast, the Although valuable in some ways, none
systematic review completed by the Office of of these previous surveys are capable of
Technology Assessment (OTA, 1983) providing a satisfactory level of guidance
suggested that laboratory studies had an regarding the American Polygraph Association
average unweighted accuracy of .832 with an (APA) 2012 standard for the use of validated
inconclusive rate of .269, while field studies techniques. These previous studies do not
had an average unweighted accuracy rate of include information describing more recent
.847 with an inconclusive rate of .042. advances in PDD criterion accuracy, and none
Crewson (2001) surveyed studies of diagnostic of these surveys does an adequate job
and screening polygraphs5 in a comparison providing a complete dimensional profile of
with medical and psychological tests, and criterion validity of individual PDD examina-
reported diagnostic polygraphs to have an tion techniques. More importantly, none of
average accuracy rate of .880. Crewson also these previous surveys satisfies the need for
reported the average accuracy of screening summary information regarding study
polygraphs as .740. The Crewson information replication and the level of reliability and
was also reported by Blackstone (2011) who generalizability of study results for individual
argued that the confusion between diagnostic PDD techniques as used in field practice.
and screening polygraphs was a reason the
polygraph did not enjoy greater support from All previous reviews of PDD test
the law.6 accuracy are unsatisfying in their ability to
answer the present question regarding the
The most recent scientific review was validity, criterion accuracy and reliability of
completed by the National Research Council PDD techniques in use today. First, previous
(NRC, 2003) who reported accuracy rates, in reviews do not address test accuracy with an
terms of the area under the curve (AUC) using adequate description of the procedural
the receiver operating characteristic (ROC) combination of the test question sequence and
analysis, and concluded that laboratory test data analysis (TDA) method applied in the
studies had an average AUC of .860 while field study, both of which are thought to have an
studies had an average AUC of .890. More important impact on test effectiveness.
recently, Kokish, Levenson, and Blasingame Second, and related to the first concern, is
(2005) reported the results of an opinion that none of the previous reviews made an
survey of sex offenders subject to polygraph effort to exclude PDD techniques that are no
monitoring as a condition of supervision and longer taught at accredited training programs
treatment. They reported that the offenders or have fallen out of use in the field. As a
expressed a high rate of agreement with the result, a number of previous reviews are of

5 Diagnostic tests are any tests conducted in response to known problems, known symptoms, known incidents, or
known allegations. Screening tests are any tests conducted in the absence of a known problem, and are intended to
search for possible problems. In practice, diagnostic tests are commonly formulated around a single issue of
concern. Screening tests, because of the absence of any known problems, and because of interest in several types of
possible problems, are often constructed around multiple issues. The terms multi-issue and mixed-issue are used
interchangeably. It is not the number of issues that defines the distinction between diagnostic and screening tests,
but the presence or absence of a known problem.

6 Blackstone (2011) also confuses the distinction between diagnostic and screening polygraphs; first by using the
less common terms “forensic” and “utility” instead of the more widely understood terms “diagnostic” and “screening,”
and then by attempting to portray single-issue screening exams as diagnostic exams. Blackstone further states that
multi-facet exams are screening exams and later that multi-facet exams are distinct from multi-issue exams in that
multi-issue exams are conducted in the absence of a known issue, indicating that multi-facet exams are a type of
diagnostic exam conducted in response to a known problem. In practice, the criterion variance of the RQs of both
multi-issue and multi-facet polygraphs is assumed to be independent, and both types of exams are interpreted with
decision rules that reflect this assumption.

205 Polygraph, 2011, 40(4)


Validated Techniques

little practical use to PDD field examiners, or format which conforms to valid PDD testing
program administrators, policy makers and principles, coupled with a valid method of test
consumers regarding the merits of different data analysis. The combination of these two
PDD techniques. core components is a recognition that a valid
test must first obtain a suitable quantity of
Present Objectives interpretable and meaningful (i.e., diagnostic)
Of primary interest to this review are information, after which the information must
those PDD techniques for which there exists be interpreted effectively. Neglecting either of
evidence in support of criterion validity at these would result in unsatisfactory test
levels required by the standards of practice of performance. Moreover, because the results
the APA, which effective January 1, 2012, of a single un-replicated study are regarded as
require the use of validated techniques. Those inconclusive in the realm of science, validated
requirements state that event-specific techniques are further required to be
diagnostic examinations conducted for supported by at least two publications.
evidentiary purposes, for which it is expected
that the results may be used as evidence in a Selecting evidenced-based techniques
judiciary proceeding, should be conducted minimizes exposure that would result from
using techniques that produce a criterion the use of un-standardized, un-validated, sub-
accuracy level of .900 or higher, excluding optimal, or experimental methods. While the
inconclusives, and with an inconclusive rate present effort was undertaken to provide
of .200 or lower. Diagnostic examinations useful information in this regard, readers are
conducted using the paired-testing protocol reminded that the report constitutes a
can achieve a very high accuracy rate through literature review of publications available at
the combination of results from examinations the time the report was issued. New
conducted with techniques that produce a instrumentation, new validity research and
mean criterion accuracy level of .860 or new methods of analysis will become available
higher, excluding inconclusives, and with in the future. In light of continuing
inconclusive rates of .200 or lower. advancements in the field, the findings in this
Examinations conducted for investigative report should be considered a reference, and
purposes should be conducted with not a policy of the APA.
techniques that produce a mean criterion
accuracy level of .800 or higher, excluding The ethics of test administration were
inconclusives, and with inconclusive rates of not addressed in this meta-analysis.
.200 or lower.7 Validated techniques are Discussions of PDD test accuracy can
further required to be supported by published sometimes digress into discussions of the
and replicated scientific studies. To be ethics surrounding the procedures for test
generalizable, the studies should be based on administration of probable-lie CQT formats,
samples that are representative of the general for which it is considered necessary to
population. psychologically maneuver the examinee in
order to achieve a satisfactory level of test
In addition to specifying requirements specificity to truthfulness and to constrain
for criterion accuracy of PDD examination false-positive errors to minimal levels. This
techniques, the APA has adopted a standard discussion can also lead to unproductive, and
specifying that a validated technique consists indeed avoidant, deflections about increased
of the combination of a test question sequence incremental validity (i.e., “test utility”) of

7 Near the completion of this report the APA Board of Directors proposed a change to the Standards of Practice

specific to screening techniques because of the paucity of available research in this area despite the importance of
this application to law enforcement and national security. The proposal would permit the use of screening
techniques if research indicates an accuracy significantly greater than chance, and recommends the use of a
successive hurdles approach to minimize errors. Because the proposal had not been voted prior to the completion of
this report, no additional analyses of screening methods using the proposed standards are included here.

Polygraph, 2011, 40(4) 206


Ad Hoc Committee on Validated Techniques

decisions made by professional consumers of measures lies. Lies are amorphous and
PDD tests.8 therefore cannot be measured per se.
Deception is a temporal act, and PDD exams,
Discussions of PDD test accuracy may like many other scientific tests, are scored
also encompass the ethical complications of numerically by measuring or observing the
conducting a test for which the examiner and examinee's response to the test stimuli.9
examinee purposes may be dissimilar. For Deception or truthfulness is inferred
example: the examinee’s desire is to generate statistically, by observing or measuring the
data and test results to exonerate himself responses to several iterations of the test
from further scrutiny, while the purpose to stimuli, aggregating the responses, and then
the examiner is often to psychologically using structured decision rules to interpret
leverage a confession of guilt or disclosure of the result through comparison with normative
information from deceptive examinees. In its data.
most limited and un-scientific use the
polygraph can become little more than a prop Construct validity can also refer to the
to enhance the effectiveness of an interview or correctness of assumptions about the function
interrogation, with little or no concern for the of the structural components of the PDD test:
test result. It is important when reporting the whether the questions function as intended.
criterion accuracy of PDD examinations to Several studies have investigated the
limit the focus to only those issues pertaining construct validity of various types of test
to the level of accuracy of polygraph decisions questions. For example: overall truth
rather than merely confession rates. In other questions have been shown not to function as
words, how effective are modern polygraph intended, Hilliard (1979) and Abrams (1984)
techniques at correctly classifying examinees both providing insight into the general
as deceptive or truthful? complications and concerns pertaining to
questions of intent. Likewise, symptomatic
Test accuracy, in a scientific sense, questions, intended to test for or correct for
means several things and can convey outside issues, have been shown to not
information about many important concerns. function as intended or reputed (Honts, Amato
Foundational among these concerns is the & Gordon, 2004; Krapohl & Ryan, 2001;),
issue of construct validity, which, in its most despite some weak evidence of support in an
simplistic representation, refers to the early study (Capps, Knill & Evans, 1993).
correctness of the underlying constructs, Technical questions designed to test for, or
principles, or ideas on which a test is make use of, esoteric phenomena such as a
constructed. Simply stated, construct validity guilt-complex have been shown to not
refers to whether the PDD test does what it is function effectively (Podlesny, Raskin &
intended to do. As a practical matter, PDD Barland, 1976). Similarly, sacrifice questions,
tests are often referred to as lie-detection regarding the examinee's intent to answer
tests, therefore, a broad formulation of the truthfully regarding the RQs, have been
question of construct validity would involve shown not to function as intended (Capps,
whether a PDD test actually tests for or 1991; Horvath, 1994).

8 The term incremental validity is preferred to the older term utility because it implies an expectation for empirical

evidence of increased decision accuracy or decision effectiveness on the part of consumers of PDD test results, as a
result of information gained from the polygraph, and not a mere assumption that all information will prove helpful
or useful.

9 Deception during PDD exams is inferred empirically in that PDD studies have shown that deception and truth-
telling can be determined with the CQT at rates that are significantly greater than chance as a function of the
differential magnitude of response to relevant and comparison stimuli. Differences in response magnitude are
thought to be a function of the salience of the stimuli. Persons who are being deceptive regarding the relevant
stimuli are expected to show responses of generally larger magnitude to relevant than comparison stimuli, while
persons who are truthful regarding the relevant stimuli are expected to show generally larger responses to
comparison stimuli.

207 Polygraph, 2011, 40(4)


Validated Techniques

Although hypotheses are abundant, seek a balance of test sensitivity and test
scientific studies have been unable to show specificity.
evidence of construct validity for the array of
technical questions, with the exception of one. This meta-analytic survey is intended
The CQ is generally capable of producing to summarize the present state of existing
larger reactions from truthful persons than published scientific evidence of criterion
RQs. While construct validity is an important validity of PDD examination techniques, and
concern, this meta-analysis addressed to provide guidance regarding those
criterion validity, the ability to differentiate techniques which can be expected to reliably
deception from truth-telling at a practical provide criterion accuracy that satisfies the
level, with an emphasis on the identification of APA's requirements for precision at the
PDD techniques for which there exists evidentiary, paired testing, and investigative
published and replicated evidence in support levels. Therefore, this study is limited to those
of test accuracy. It did not address questions techniques for which the available evidence
pertaining to construct validity. supports their criterion validity, and does not
include PDD techniques with un-replicated
Criterion accuracy of CQT methods research, or techniques for which there is
used in PDD examinations cannot be replicated evidence at levels that do not satisfy
adequately described by a single numerical the requirements of the APA standards.11
value. Instead, criterion validity for CQT PDD
examinations is the result of an interaction of Method
several dimensions of concern, including
correct and false hits for deceptive decisions, A literature survey was conducted to
and correct and false hits for truthful identify published studies that provided
decisions. In field practice, test results are usable information regarding the criterion
interpreted or expressed in terms of the accuracy of identified PDD techniques. The
presence or absence of significant reactions results of un-replicated studies are not useful
indicative of deception. Categorical test in meta-analytic research, and were therefore
results for individual cases are either positive not included. As a practical decision this
or negative,10 and the criterion validity of a meant that at least two published studies
test is estimated by partitioning the results of were required for the inclusion of an
sample cases into true positives, false examination technique in the meta-analysis.
positives, true negatives and false negatives. Examination techniques were retained in the
Attempts to describe criterion accuracy are meta-analysis if published and replicated
further complicated by inconclusive results, studies were identified in support of the
for which sample results are partitioned into validity of a technique, and if the aggregated
inconclusive results for deceptive and truthful results of the included studies indicated a
groups. In addition to the challenges of reliable and generalizable level of accuracy
measuring and describing estimates of PDD consistent with the requirements of the APA
test accuracy, different polygraph techniques Standards of Practice for evidentiary testing,
achieve different dimensional profiles of paired testing, or investigative testing.
criterion accuracy. Some techniques may be However, it was not a requirement that
intended to provide high test sensitivity at the individual studies produce criterion accuracy
expense of other dimensions of test accuracy, at the levels specified by the APA.
while other techniques may be designed to

10Although PDD test results are interpreted, in field practice, for the presence or absence of significant indicators of
deception, the results of scientific tests are often discussed in value-neutral language. Positive test results are
designated to signify the presence of the issue or concern that is being tested. Negative test results signify the
absence of the concern or issue.

11 Because research is ongoing in all fields of science, and because standards of practice undergo periodic review

and necessary modification, readers are reminded that other PDD techniques may satisfy the present requirements
of the APA standards. Field examiners, program administrators and quality assurance reviewers are advised to
evaluate this information with awareness for new and emerging standards and information.

Polygraph, 2011, 40(4) 208


Ad Hoc Committee on Validated Techniques

It was deemed important to be as both qualitative and quantitative. Some


inclusive as possible with stakeholders (i.e., studies were not designed to function as
school directors and developers of PDD studies of criterion validity, but were intended
techniques) in the selection of studies and to investigate specific research questions,
techniques for the meta-analysis as it was such as the effects of countermeasures on
anticipated that the work of many PDD field PDD accuracy. Studies designed to examine
examiners and trainers may be affected by the causality and construct questions may not be
results and recommendations of this meta- useful for answering questions about criterion
analysis. To accomplish this, a long-list of all accuracy, however, studies were included if
identifiable PDD techniques was assembled they provided sufficient information to
and disseminated to all APA accredited calculate the criterion accuracy of a survey
polygraph schools in late March 2011 using sample or normal control group that was not
the contact information listed at the APA subject to experimental manipulation beyond
website (www.polygraph.org). School directors truthfulness or deception.
or their representatives were invited to advise
the committee of any techniques which had The APA president and the committee
not yet been identified, and to provide either chairperson expressed to the committee
citations or copies of studies that could be members that additional studies should be
accessed and reviewed in support of the considered for inclusion in the meta-analysis
suggested techniques.12 if there was sufficient time to complete the
review and publication process prior to the
In early June 2011 a short-list was completion of the meta-analysis. Several
disseminated to all APA accredited polygraph studies were subsequently completed and
schools, again using the contact information submitted for peer-review and publication.
at the APA website, including both techniques Results from those studies were included once
and citations for which published and the studies were accepted for publication. All
replicated studies were identified. Also, a list of these “in press” studies were designed as
was sent describing those techniques for criterion accuracy studies, consistent with the
which the basis of publication was inadequate requirements of the meta-analysis.
for inclusion in the meta-analysis. School
directors or their representatives were again Qualitative selection requirements.
invited to respond and advise the committee of Qualitative requirements for selection and
any techniques or any published studies that inclusion in the research review were that
should be considered for inclusion in the studies selected must be published in the
meta-analysis.13 journal Polygraph or other peer-reviewed
scientific publication.14 Studies were also
Study Selection considered for selection if they were published
Requirements for the selection of by an academic degree-granting institution
individual studies into the meta-analysis were that was accredited by an accrediting agency

12One technique, developed by Lynn Marcy, was requested to be added to the list for review. However, no published
studies could be located regarding that technique. At least one school representative recommended additional
research for several PDD techniques.

13One study was suggested for inclusion in the meta-analysis at that time, the Gordon et al. (2000) field study on
the IZCT, for which the 2010-2011 APA President was the developer and has a proprietary interest. However, the
Gordon et al. (2000) included no reliability statistics, no statistical parameters or description of the sampling
distributions of deceptive and truthful scores. The committee was advised by the primary author (Personal
communication, June 10, 2011) that he had never seen the data or the cases because they belong to the intelligence
service of a foreign government. It was determined that this study could not be included due to the lack of published
reliability data, and inability to evaluate the study data with that from other studies. Exclusion of this study did not
prevent the IZCT from being included in the meta-analysis.

14The journal Polygraph instituted expert peer review in 2003. Articles published prior to that time were subject only
to editorial review. Because Polygraph is an important academic and historic resource, studies published prior to
2003 and without peer review were included in this meta-analysis if they satisfied all of the other qualitative and
quantitative requirements for selection.

209 Polygraph, 2011, 40(4)


Validated Techniques

that is recognized by the United States (US) Because the decision of the original examiner
Department of Education or its foreign could have been influenced by extra-
equivalent. Also considered for selection were polygraphic information, blind scores were
research publications of studies funded by preferred over original examiner decisions.17
government agencies that underwent external Basic demographic information was required
peer review. Edited academic texts and their regarding the examinees and the experience
chapters were also included. Studies not and training of the examiners.
subject to editorial or external review, and
studies described only in self-published books Principal investigators for studies
were not considered. selected for inclusion were required to be
blind to the criterion status of the cases and
Selection into the meta-analysis study participants.18 Although the present
required that studies were conducted in a APA research standards require that principal
manner that allows for the confident investigator not participate in the data
generalization of the study results to field collection, this requirement did not exist prior
settings. These requirements included that to March 2011, and therefore was not
studies be conducted using instrument enforced in the selection of studies for the
recording and component sensors that reflect meta-analysis.
field practices (i.e., using two pneumograph
sensors, electrodermal sensors, and Laboratory and field studies. Field studies
cardiograph arm-cuff sensors), and a PDD are important to polygraph research as these
technique for which there exists a published studies have the advantage of known
description of the examination technique, ecological validity and are therefore assumed
including rules and procedures for target to have increased generalizability in this
selection, question formulation, and test regard. However, the representativeness and
presentation of the test stimuli. In addition, it generalizability of field studies are
was required that the method of test data compromised, to some unknown degree, by
analysis be consistent with field practice and the inherently non-random case selection
supported by a published description of its process which depends on the availability of
features, transformations (scoring rules), confirmation data. While field studies are
decision rules and normative data15 or highly useful for studying correlations, their
cutscores. Results from studies that solely usefulness is frustrated by the impossibility of
utilized automated algorithmic TDA models controlling enough variables to determine
were not included in the meta-analysis. causality and construct validity.

Ground truth criteria must have been Laboratory studies are also important
independent of the polygraph decision.16 to polygraph research as these studies can

15Most PDD TDA methods do not use normative data, but use traditional cutscores that were determined
hypothetically or arbitrarily. Although traditional cutscores have been verified as effective, the lack of normative
data means that the level of statistical significance for traditional cutscores remains unknown and these cutscores
might be suboptimal.

16 Confirmation based on confession alone would exclude inconclusive and error cases, and would tend to inflate

accuracy calculations. Judicial outcomes as a criterion and are also not independent if polygraph evidence was
considered during the judicial proceedings, and could lead to inflated accuracy estimates. One included study
(Mangan, Armitage &Adams, 2008) did not meet this requirement, and was based only on sample cases that were
confirmed by confession. Not surprisingly, the study resulted in a reported 100% accuracy rate. Verschuere, Meijer,
& Merckelbach (2008) argued the results of this study as a methodological artifact and therefore unreliable.

17As a practical matter, the inclusion of extra-polygraphic information may be advantageous if it increased the
accuracy of field examiners. In the present analysis, answers to questions of criterion validity pertain only to
whether or not the PDD exam data contain information that can be scored and interpreted to an accuracy
conclusion.

18One study, (Gordon et al.,2005), was reported as non-blind in another report by Mohamed et al.(2006) who
described the same comparison of fMRI results with those of the PDD examination.

Polygraph, 2011, 40(4) 210


Ad Hoc Committee on Validated Techniques

more easily be constructed using random meta-analysis were that studies provide
methods that reduce research and sampling sufficient information to calculate reliability
bias, and thereby increase the generalizability and criterion validity of the technique
of resulting information. Laboratory studies employed. In order to calculate a complete
also provide a greater ability to control a dimensional profile of criterion accuracy,
broader range of variables, and are important reported or available data must minimally
to the study of causality and construct include the sample size for truthful and
validity. However, the generalizability of deceptive groups, along with correct decisions,
laboratory studies is thought to be reduced by inconclusive rates and error rates for the
the fact that these studies are sometimes truthful and deceptive cases. Several of the
conducted in circumstances that imperfectly principal investigators were contacted for
approximate the ecological conditions of field additional information regarding the sampling
examinations. distributions. Quantitatively, studies were
excluded only due to the lack of adequate
For the purpose of reviewing the information available to calculate the
current state of validation regarding existing reliability and generalizability of the study
polygraph techniques, the ad hoc committee results, and the dimensional profile of
chose to regard field and laboratory studies criterion accuracy.
with equal consideration if they satisfied the
qualitative and quantitative requirements for Reliability. Several different statistical
selection into the meta-analysis. Differences metrics have been used to describe the
between criterion accuracy of field and reliability of PDD examination techniques,
laboratory studies have historically been including the Pearson product moment
statistically insignificant, and it would be correlation of the numerical scores of study
unwise to attempt to opine or hypothesize participants, Cohen's Kappa statistics
about the meaning of any differences observed describing the chance-corrected level of
in a single comparison. Research studies and agreement between two study participants,
reviews have shown a high level of agreement Fleiss' Kappa statistics describing the level of
between field and laboratory studies, and the agreement between three or more
ultimate cause of any differences should be participants, and the uncorrected proportion
determined through the study of data. of decision agreement between study
Anderson, Lindsay, and Bushman (1999) participants. None of these reliability metrics
recently examined a broad range of was favored over the others, and studies were
laboratory-based psychological research on not excluded for missing reliability data so
aggressive behavior and concluded the long as they included any statistical
following, "correspondence between lab- and description of the interrater reliability of
field-based effect sizes of conceptually similar numerical scores or decisions.19
independent and dependent variables was
considerable. In brief, the psychological Sampling distributions. Mean and standard
laboratory has generally produced truths, deviation parameters were required, or at least
rather than trivialities." (p. 3). In the area of minimally able to be calculated from available
research directly related to the polygraph the data, for the deceptive and truthful
NRC (2003) found no significant differences distributions of scores for all selected studies.
between the results of laboratory and field It was expected that multiple samples drawn
research. Similarly, Pollina et al. (2004) from the same underlying population and
found no significant differences in scored with the same TDA method would
classification accuracy of field and laboratory produce sampling distributions that do not
polygraph research. differ significantly. It was also expected that
replication and aggregation of the results of
Quantitative selection requirements. sampling distributions would produce results
Quantitative requirements for inclusion in the that are more representative and generalizable

19One included technique did not meet this requirement. None of the published studies on the IZCT have included
any statistical evidence of inter-rater reliability.

211 Polygraph, 2011, 40(4)


Validated Techniques

than any single sampling distribution, hence including: test sensitivity and specificity (i.e.,
one of the reasons for requiring at least two test accuracy for deceptive and truthful
studies of a technique. groups excluding inconclusive results),
inconclusive rates for the deceptive and
Some studies were published with truthful groups, positive predictive value (PPV)
incomplete descriptions of the sampling (i.e., the proportion of true positives to all
distributions. Therefore, it was necessary to positive results) and negative predictive value
obtain raw scores from some of the principal (NPV) (i.e., the proportion of true negatives to
investigators in order to calculate these all negative results), and the proportion of
missing statistics.20 Most data were still correct decisions excluding inconclusive
available, and investigators were willing to results among deceptive and truthful cases,
provide additional information as requested.21 labeled unweighted accuracy. It was not
It was expected that the sample distributions necessary for a study to provide all of these
for each PDD technique would not differ at dimensional descriptions, and studies were
statistically significant levels if the samples included if it was possible for the committee to
were obtained from examinees who were calculate these statistics from the available
representative of the same underlying information. The complete dimensional profile
population, the PDD technique was could be calculated from a minimum of five
administered in a similar manner for each values: decision accuracy for truthful and
study, and the data were scored and deceptive groups, with or without inconclusive
interpreted with a similar application of the results, along with the inconclusive rates for
rules and procedures for test data analysis. the deceptive and truthful groups and the
number of study participants assigned to each
The absence of significant differences group.
in sampling distributions would be interpreted
as indicative that the sample distributions are The reduction of these important
representative of each other. This would criterion dimensions to a single number
increase our confidence regarding the cannot be accomplished without neglecting a
reliability with which the samples are substantial portion of important information.
representative of, and generalizable to, other Another complication was that some
testing populations. Significant differences measures of accuracy are non-resistant to
would be interpreted as indicative of samples difference in base-rates or prior probabilities.
drawn from different populations, or to For example: PPV and NPV are vulnerable to
differences in PDD test administration or the differences in base-rates and inconclusive
application of the rules and procedures for rates, so these criterion dimensions are less
test data analysis. Confidence in the useful for comparison of the accuracy of
reliability and reproducibility would be different techniques for which the studies may
reduced under these circumstances. be conducted using samples with differing
base-rates. The unweighted average of correct
Criterion accuracy decisions, and the unweighted average of the
Study information for criterion validity inconclusive rates for deceptive and truthful
was regarded as sufficient for inclusion in the cases was determined to provide the most
meta-analysis if a study provided enough usable and generalizable criterion information
information to calculate the complete when comparing studies with potentially
dimensional profile of criterion accuracy, different base-rates.

20Statistical descriptions of sampling distributions are now commonly required for publication in scientific journals,
including the journal Polygraph. Editors and reviewers of scientific publications did not always require this
information in the past because the importance of future meta-analytic research was not always anticipated.

21Statistical parameters or raw scores could not be obtained for three studies, regarding two techniques, which were
conducted by the US Department of Defense. Although committee members were aware that the studies had been
subjected to thorough and adequate review by scientists at the Department of Defense, the absence of the data was
inconvenient. Reliability statistics were provided in the Department of Defense study reports, and it was decided to
retain these studies in the meta-analysis. These studies were later replicated independently.

Polygraph, 2011, 40(4) 212


Ad Hoc Committee on Validated Techniques

Moderator variables number of scorers were given proportionally


Results of included studies and PDD more weight in the meta-analysis. Because
techniques were coded and grouped for some samples were collected with multiple
conformance with the APA 2012 standards for scorers who scored only a subset of the entire
evidentiary testing, paired-testing, and sample, weighting values are equivalent to the
investigative testing. PDD techniques were number of scored results for each study and
also coded for whether they are intended to be each PDD technique.
interpreted with the assumption of
independent or non-independent criterion Research questions addressed by this
variance among the RQ stimuli. No other meta-analysis are: 1) which PDD examination
moderators or mediators were coded for this techniques have published and replicated
study. evidence of validity that satisfies the APA 2012
standards of practice requirement for decision
Data analysis accuracy and inconclusive rates, 2) what is
All samples were regarded as biased the overall accuracy of validated PDD
and imperfect representations of the techniques interpreted with the assumption of
populations from which they are drawn. This independence among the RQ stimuli, 3) what
meant that some differences would be is the accuracy level of PDD techniques
expected to be observed between the sample interpreted with the assumption of non-
distribution parameters and the population independence, among the RQ stimuli, 4) are
distributions parameters if it were possible to there significant differences or outliers among
obtain data from the entire population. any of the validated PDD techniques, and 5)
However, representative samples would be are there any outlier results that are not
expected to deviate from the population in accounted for by the presently available
ways that are not statistically significant. evidence.
Samples from different studies, if all are
representative of the population, would also Results
be expected to differ in ways that are not
statistically significant. Alpha was set at .05 for all statistical
comparisons.
Multivariate ANOVAs were used to
compare the sampling distribution parameters Validated PDD Techniques
of each study to the sampling distributions of Thirty-eight studies satisfied the
other replication studies for each PDD criteria for inclusion in the meta-analysis.
examination technique. Monte Carlo methods These studies were based on 32 different
were used to calculate standard errors and samples of confirmed cases, from which 45
statistical confidence intervals for the criterion different samples of scores were obtained.
accuracy profiles of each of the studies These studies involved 295 scorers who
included in the meta-analysis. Data were provided 11,737 scored results of 3,723
aggregated for those techniques that are examinations, including 6,109 scores of 2,015
intended to be interpreted with the confirmed deceptive examinations, 5,628
assumption of independence among the RQ scores of 1,708 confirmed truthful exams.
stimuli, and those that are interpreted with Some of the samples were used in different
the assumption of non-independence. studies, some were scored using different TDA
models, and some samples were scored by
Results of the meta-analysis were not more than one scorer. Appendix A shows a
graded or weighted for study quality, other list of the included studies and sample sizes.
than the quantitative selection criteria that Criterion accuracy reported for each of the
were previously described for inclusion in the included studies can be seen in Appendix B.
meta-analysis. In other words, all studies Reliability statistics for the included studies
were considered equally if they satisfied the are shown in Appendix C. Appendix D shows
qualitative requirements for inclusion in the the mean and standard deviations for the
meta-analysis. Study results were weighted sampling distributions of deceptive and
by sample size and number of participating truthful scores for the included studies.
scorers, in that studies based on larger Fourteen PDD techniques were identified as
samples and studies that involved a greater being supported by published and replicated

213 Polygraph, 2011, 40(4)


Validated Techniques

studies that met the qualitative and criterion truthful cases and 50 criterion
quantitative selection requirements for this deceptive cases. Unweighted decision
meta-analysis. Presented alphabetically, they accuracy was reported as .814, along with an
were: unweighted inconclusive rate of .280.

AFMGQT / Seven-position TDA Nelson, Handler, Morgan and O'Burke


(In press) obtained blind numerical scores
The United States Air Force Modified from three experienced examiners employed
General Question Technique (AFMGQT)22 is a by the Iraqi government who used the seven-
modern variant of the family of CQT formats position TDA model to evaluate a confirmed
that have emerged as modifications of the case sample of AFMGQT (N = 22) exams that
original General Question Technique (Reid, were selected from the U.S. Department of
1947) and Zone Comparison Technique Defense confirmed case archive. Eleven of the
(Backster, 1963). The AFMGQT can be used cases were confirmed as deceptive; the
effectively with two, three or four RQs. remaining 11 were confirmed as truthful. A
AFMGQT examinations are used in both total of 66 examination scores were obtained,
multi-facet event-specific diagnostic contexts and unweighted decision accuracy was
and multi-issue screening contexts (e.g., reported as .761, along with an unweighted
public safety employee selection, government inconclusive rate of .242.
security screening, post-conviction screening
programs, etc.). AFMGQT exams conducted in Figure 1 shows a mean and standard
both multi-facet and multi-issue contexts are deviation plot of the subtotal scores23 of the
interpreted with decision rules based on an sampling distributions of the three AFMGQT
assumption that the criterion variance of the seven-position studies. No mean and
RQs is independent. Three studies describe standard deviation data were available for the
the criterion accuracy of the AFMGQT when AFMGQT study completed by Senter, Waller
scored with the seven-position TDA model. and Krapohl (2008). A two-way ANOVA
showed that the interaction of sampling
Senter, Waller and Krapohl (2008), distribution and criterion status was not
using blind seven-position numerical scores of significant [F (1,68) = 0.263, (p = .610)], nor
33 programmed deceptive and 36 was the main effect for sampling distribution
programmed truthful examinees who were [F (1,68) = 0.013, (p = .910)].
tested using the AFMGQT following their
participation in a mock roadside bombing The combined decision accuracy level
scenario, reported an unweighted decision of these AFMGQT seven-position TDA studies,
accuracy of .849, with an inconclusive rate of weighted for sample size and number of
.015. scorers, was .822 with a combined
inconclusive rate of .191. Reliability for
Nelson and Handler (In press) used seven-position scores of AFMGQT exams was
Monte Carlo methods to study the criterion reported by Senter, Waller and Krapohl (2008)
accuracy of seven-position numerical scores of as kappa statistic of .750. The proportion of
AFMGQT exams with two, three and four RQs. overall decision agreement, excluding
The Monte Carlo space consisted of 50 inconclusive results, for all studies was .965.

22Two versions exist for the AFMGQT: version 1 and version 2. Differences between these two techniques are based
on unstudied assumptions regarding test structure, and the effect of these differences has not been thoroughly
studied. There is no compelling hypothesis suggesting the performance of one version would be different or superior
to another. Evidence available at the present time suggests that both versions perform adequately and no significant
differences have been identified. Therefore, these results are suggested as generalizable to versions 1 and 2 of the
AFMGQT, and both of which are represented in the included studies.

23Subtotal scores were used in this analysis because the AFMGQT is scored with decision rules using only subtotal
scores which assume criterion independence among the RQs.

Polygraph, 2011, 40(4) 214


Ad Hoc Committee on Validated Techniques

Figure 1. Mean deceptive and truthful subtotal scores for AFMGQT seven-position studies.

AFMGQT / ESS truthful cases and 50 criterion deceptive


cases. Unweighted decision accuracy was
Three studies describe the criterion reported as .876, with an unweighted
accuracy of AFMGQT exams when scored inconclusive rate of .178.
using the ESS.
Figure 2 shows a mean and standard
Nelson and Blalock (In press) deviation plot of the scores of the sampling
transformed seven-position AFMGQT scores distributions of the three AFMGQT ESS
from the Senter, Waller and Krapohl (2008) studies. A two-way ANOVA showed that the
laboratory study to ESS scores, including 33 interaction of sampling distribution and
results for confirmed deceptive cases and 36 criterion status was not significant [F (1,123)
results for confirmed truthful cases. = 2.467, (p = .119)], nor was the main effect
Unweighted decision accuracy was .839, with for sampling distribution [F (1,123) = 0.009, (p
an inconclusive rate of .152. = .925)].

Nelson, Blalock and Handler (2011) The combined decision accuracy level
obtained blind ESS scores from two of these AFMGQT ESS studies, weighted for
inexperienced examiners and one experienced sample size and number of scorers, was .875
examiner who used the ESS to evaluate a with a combined inconclusive rate of .170.
sample of confirmed AFMGQT exams (N = 22), Reliability for ESS scores of AFMGQT exams,
including 11 exams that were confirmed as reported by Nelson, Blalock and Handler
deceptive and 11 exams that were confirmed (2011) as the bootstrap mean of pair-wise
as truthful. A total of 66 examination scores correlation coefficients, was .930.
were obtained, and unweighted decision
accuracy was .883, with an inconclusive rate Backster You-Phase
of .183.
The Backster You-Phase technique is
Nelson, Handler and Senter (In press) an event-specific diagnostic technique, based
used Monte Carlo methods to study the on Backster’s Zone Comparison concept. This
criterion accuracy of ESS scores of AFMGQT technique is scored using TDA rules developed
exams with two, three and four RQs. The by Cleve Backster and taught almost
Monte Carlo space consisted of 50 criterion exclusively at the Backster School of Lie

215 Polygraph, 2011, 40(4)


Validated Techniques

Figure 2. Mean deceptive and truthful subtotal scores for AFMGQT ESS studies.

Detection (2011). Both generic ZCT Monte Carlo study of You-Phase exams was
(Department of Defense, 2006; Honts, Raskin .927, with an inconclusive rate of .321.
& Kircher, 1987) and boutique ZCT variants
(Gordon et al., 2000; Matte & Reuss, 1989) Nelson, Handler, Adams and Backster
have been developed from the Backster You- (In press) surveyed the results of seven
Phase technique. Scores from two recent examiners who provided 144 blind numerical
studies were aggregated to calculate the scores for a sample (N = 22) of 11 confirmed
criterion accuracy profile of Backster You- deceptive and 11 confirmed truthful You-
Phase examinations. Phase examinations. These seven scorers had
a range of experience from less than one year
Nelson (In press) used seeding to more than 30 years. Results from the
parameters calculated from the composite blind-scores survey produced an unweighted
distributions of two non-selected studies24 to decision accuracy rate of .825 with an
seed a Monte Carlo space of 100 simulated inconclusive rate of .117. Figure 3 shows a
Backster You-Phase exams. The Monte Carlo mean and standard deviation plot for the
space consisted of 50 criterion truthful cases scores of deceptive and truthful cases. A two-
and 50 criterion deceptive cases. Unweighted way unbalanced25 ANOVA showed that the
decision accuracy for the Nelson (In press) interaction of sampling distribution and

24Honts, Hodes and Raskin (1985) used the Backster You-Phase technique in a countermeasure study for which the
traditional arm cuff was replaced with an alternative cardio sensor. Meiron, Krapohl and Ashkenazi (2008) used the
Backster You-Phase technique in a study of the Either-Or Rule, using a highly selected sample from which the
results of problematic examinations were not included, resulting in a sample that is assumed to be systematically
devoid of error variance. Although neither of these studies was usable alone, the parameters that describe the
composite distributions of deceptive and truthful scores was assumed to be a more generalizable representation of
error or uncontrolled variance along with diagnostic variance for scores from the Backster You-Phase exams.

25 Unbalanced ANOVAs, using the harmonic mean of the sample Ns, was used throughout this study when

necessitated by differences in sample sizes. As a result, the total degrees of freedom in the ANOVA summary may
not reflect the sum of all samples in the same way as a balanced ANOVA design. Unbalanced ANOVA designs can be
expected to provide slightly less statistical power than balanced ANOVA designs.

Polygraph, 2011, 40(4) 216


Ad Hoc Committee on Validated Techniques

Figure 3. Mean and standard deviation plot for Backster numerical scores of confirmed
You-Phase exams.

criterion status was not significant F (1,68) = information that would be available only to
0.869, (p = .355)], nor was the main effect for investigators and a guilty or involved suspect.
sampling distribution [F (1,68) = 0.164, (p = Like the CQT, the CIT/GKT is based on the
.686)]. principle of salience, including emotion,
cognition and behavioral conditioning as the
The combined results of the two psychological basis of physiological response
published studies of Backster You-Phase (Senter, Weatherman, Krapohl & Horvath,
exams, weighted for the sample size and 2010). Also, the CIT/GKT is conducted using
number of scorers, produced a decision instrumentation that is similar to CQT
accuracy rate of .863 and an inconclusive rate methods, including electrodermal sensors,
of .196. Reliability of Backster numerical and may include cardiograph and pneumo-
scores of You-Phase exams, reported by graph sensors. However, the CIT/GKT does
Nelson et al. (In press) as a bootstrap mean of not include comparison questions and is not a
pairwise correlation coefficients, was .567. CQT method. Therefore, the CIT/GKT has not
been subject to the same ethical concerns as
Concealed Information Test (CIT) probable-lie CQT methods regarding
manipulating the examinee as a feature of test
The CIT, also known as a Guilty administration.26 Hypothetical explanations
Knowledge Test (GKT; Lykken, 1959) and for psychophysiological mechanisms under-
related to the Peak of Tension (POT) technique lying the CIT/GKT have not been limited to
(Ansley, 1992), is an event-specific diagnostic emotion and fear as the sole basis of response.
technique that can be used to investigate Also, the CIT/GKT has remained free from
whether an examinee possesses knowledge or scientific criticisms involving the role of

26Similarly, directed-lie methods have been shown to work as well as probable-lie methods, and are less subject to
the ethical complications regarding manipulating the test subject (Bell, Kircher & Bernhardt, 2008; Blalock, Nelson,
Handler & Shaw, 2011; Honts & Reavy, 2009).

217 Polygraph, 2011, 40(4)


Validated Techniques

examinee confessions as both an objective of name, the DLST technique uses directed-lie
the test and as a form of verification of the CQs and is designed for screening exams that
test results.27 are conducted in the absence of any known
problem. Screening tests are often
MacLaren (2001) published the results constructed and interpreted with multiple
of a meta-analysis of 50 samples in 22 studies issues for which the criterion variance of the
involving the CIT/GKT. Thirty-nine of those CQs is assumed to be independent. Two
samples involved 1,070 examinees of which studies, involving the DLST/TES and seven-
666 had engaged in a behavioral act for which position TDA, have been published by the U.S.
they had concealed information along with Department of Defense. Research reports
404 examinees who had no involvement or were available, though insufficient to meet the
knowledge of the behavioral details. Eleven study inclusion criteria, due to the absence of
samples involved 177 examinees who sampling standard deviations in the study
possessed concealed knowledge of, but did not reports. Also, data for the Department of
actually engage in, the behavioral act under Defense studies were not available for this
investigation. committee to review. However, two DLST/TES
replication studies were recently completed
Using the scoring protocol and test and the data from those studies was included
methodology described by Lykken (1959), in this meta-analysis.
results reported by MacLaren produced a test
sensitivity level of .815 for behaviorally The first published study of the
involved examinees, along with a test DLST/TES (Research Division Staff, 1995a)
specificity level of .832 for un-involved was a laboratory experiment involving a mock
examinees who had no concealed knowledge. espionage scenario. Three experienced
Unweighted decision accuracy was .823. federally trained examiners scored 94
However, when results included those examinations, involving 26 programmed
examinees who possessed concealed deceptive examinees and 68 programmed
information but were behaviorally un- truthful examinees. Results from this study
involved, the test sensitivity level was .759, produced an unweighted decision accuracy
and the unweighted decision accuracy rate level of .788, with an inconclusive rate of
was .795. .155.29

Directed-lie Screening Test / Seven-position In the second published study of the


TDA DLST/TES (Research Division Staff, 1995b) 10
experienced federally trained examiners
The Directed-lie Screening Test (DLST) provided scores for 30 deceptive and 55
is based on the Test for Espionage and truthful laboratory examinations involving a
Sabotage (TES) that was developed by the U.S. mock espionage scenario. Results from this
Department of Defense.28 As indicated by its study produced an unweighted decision

27 Overemphasis on confession confirmation and non-independent criterion has led to criticisms of contamination

and overestimation of CQT accuracy.

28 The name Directed Lie Screening Test is used in contexts for which the investigation targets differ from espionage

and sabotage.

29 Although all information was included in the published reports, some false-positive errors and inconclusive

results were excluded from previously reported statistics for this study. False-positive results were removed from the
reported results when the examinee made post-test admissions that would have been viewed as substantive in field
settings, causing the results to be viewed as not erroneous. Inconclusive results were removed from the previously
reported results because DLST/TES procedures which require the immediate re-examination of inconclusive results.
It was not possible for the blind scorers to repeat inconclusive examinations, and the investigators elected to
describe DLST/TES accuracy without inconclusive results that could not be subject to re-examination. All false-
positive and inconclusive results were included in the present results because this was considered a more
conservative estimate of criterion accuracy for this technique.

Polygraph, 2011, 40(4) 218


Ad Hoc Committee on Validated Techniques

accuracy level of .888, with an inconclusive accuracy level was .831 with an inconclusive
rate of .009.30 rate of .092.

Nelson (In press) used Monte Carlo Figure 4 shows a mean and standard
methods to study the criterion accuracy of the deviation plot for the seven-position deceptive
DLST/TES, and reported an unweighted and truthful scores of DLST/TES exams. No
decision accuracy level of .874, with an mean and standard deviation data were
inconclusive rate of .096. The Monte Carlo available for the TES studies completed by the
space consisted of 50 criterion truthful cases US Department of Defense. A two-way
and 50 criterion deceptive cases. ANOVA showed that the interaction of
sampling distribution and criterion status was
Nelson, Handler, Blalock and not significant [F (1,128) = 0.109, (p = .742)],
Hernández (In press) studied the DLST/TES in nor was the main effect for sampling
a mock espionage scenario with examiner distribution [F (1,128) = 0.023, (p = .880)].
trainees from the Iraqi government. Two
scorers, including one experienced federally The combined results of the four
trained examiner who is also an APA primary published studies of the DLST/TES with
instructor, and one international examiner seven-position TDA, weighted for the sample
who is an APA member from México, provided size and the number of scorers, produced a
blind seven-position scores for 25 decision accuracy rate of .844, and an incon-
programmed deceptive and 24 programmed clusive rate of .088. Reliability of DLST/TES
truthful examinees. Fifty scores were studies completed by the U.S. Department of
obtained for programmed deceptive examinees Defense, calculated via Cohen's Kappa, was
and 48 scores were obtained for programmed reported as .760. The average proportion of
innocent examinees. The unweighted decision pairwise decision agreement for all DLST/TES
seven-position studies was .806.

Figure 4. Mean and standard deviation plot for seven-position scores of DLST/TES exams.

30 Two-false positive errors were removed from the previously reported accuracy estimations due to post-test
admissions that were deemed by the principal investigator to have been likely to be considered substantive and not
erroneous in field settings. One inconclusive result was removed from the previous accuracy estimations.
Calculations in this report include all error and inconclusive results.

219 Polygraph, 2011, 40(4)


Validated Techniques

Directed-lie Screening Test / ESS DLST/TES examinations to seven-position


and three-position scores. The Monte Carlo
Four studies describe the criterion space consisted of 50 criterion truthful cases
accuracy of the DLST/TES when scored with and 50 criterion deceptive cases. Unweighted
the ESS. decision accuracy of ESS scores was reported
as .871, with an inconclusive rate of .048.
Nelson and Handler (In press) used
Monte Carlo methods to study the criterion Nelson, Handler, Blalock and
accuracy of DLST/TES exams scored with the Hernández (In press) reported the results of
ESS. The Monte Carlo space consisted of 50 blind ESS scores of DLST/TES examinations.
criterion truthful cases and 50 criterion Seven-position scores from two scorers,
deceptive cases. Unweighted decision including one experienced federally trained
accuracy was reported as .831 with an examiner who is also an APA primary
inconclusive rate of .104. instructor, and one international examiner
who is an APA member from México, were
Nelson, Handler and Morgan (In press) transformed to ESS scores, including 50 blind
used a mock espionage scenario to study the scores for 25 programmed guilty and 48 blind
criterion accuracy of ESS scores of DLST/TES scores for 24 programmed innocent
exams conducted by seven inexperienced examinees. Unweighted decision accuracy
examiner trainees on eight non-naive was .859, and the unweighted inconclusive
examinees who were fully conversant with the rate was .123.
operation and scoring of PDD examinations
including the DLST/TES. Blind scores were Figure 5 shows a mean and standard
obtained for 25 programmed deceptive exams deviation plot of the scores from the four
and 24 programmed truthful exams. DLST/TES ESS studies. A two-way ANOVA
Unweighted decision accuracy was .854 with showed that the interaction of sampling
an inconclusive rate of .088. distribution and criterion status was not
significant [F (1,289) = 2.396, (p = .123)], nor
Nelson (In press), using a different was the main effect for sampling distribution
Monte Carlo method, compared ESS scores of [F (3,289) = 0.156, (p = .925)].

Figure 5. Mean and standard deviation plot for ESS scores of DLST/TES exams.

Mean Deceptive and Truthful Scores from DLST/TES ESS 
Studies
10
8
6
4.660 3.265 3.636
4
2.086
2
Deceptive
0 ‐1.271
‐2.442 ‐3.031 ‐1.781 Truthful
‐2 Nelson &  Nelson  Nelson Nelson 
‐4 Handler Handler &  Handler 
Morgan  Blalock & 
‐6
Hernández
‐8
‐10

Polygraph, 2011, 40(4) 220


Ad Hoc Committee on Validated Techniques

The combined decision accuracy level Carlo space consisted of 50 criterion truthful
of these studies, weighted for sample size and cases and 50 criterion deceptive cases.
number of scorers, was .858 with a combined Unweighted decision accuracy was .870, and
inconclusive rate of .090. Reliability of ESS the unweighted inconclusive rate was .301.
scores of DLST/TES examinations, reported
as the bootstrap mean of the proportion of the Nelson, Handler, Blalock and
pairwise decision agreement, excluding Cushman (In press) obtained blind scores
inconclusive results, was .911 for the Nelson, from eight inexperienced and two experienced
Handler and Morgan (In press) study, and scorers who used the seven-position TDA
.769 for the Nelson, Handler, Blalock and model to provide 220 scores for a sample of
Hernández (In press) study. The average rate Federal You-Phase examinations (N = 22)
of pairwise decision agreement for these selected from the confirmed case archive of
studies was .840. the U.S. Department of Defense. Eleven of the
cases were confirmed as deceptive, and 11
Federal You-Phase / Seven-position TDA were confirmed as truthful. Unweighted
decision accuracy was .885, with an
The Federal You-Phase technique31 is unweighted inconclusive rate of .108.
an event-specific diagnostic technique con-
structed with two RQs. Two studies describe Figure 6 shows a mean and standard
the criterion accuracy of the Federal You- deviation plot of the scores from the Federal
Phase technique when scored using the seven- You-Phase seven-position criterion studies. A
position TDA model. two-way ANOVA showed that the interaction
of sampling distribution and criterion status
Nelson (2011) used Monte Carlo was not significant [F (1,68) = 0.628, (p =
methods to calculate the criterion accuracy of .431)], nor was the main effect for sampling
Federal You-Phase exams when scored with distribution [F (1,68) = 0.001, (p = .977)].
the seven-position TDA model. The Monte

Figure 6. Mean and standard deviation plot for seven-position scores of


Federal You-Phase exams.

31 Sometimes referred to as the Bi-Zone technique.

221 Polygraph, 2011, 40(4)


Validated Techniques

The combined decision accuracy level transforming 220 blind seven-position scores
of these studies of Federal You-Phase exams obtained from eight inexperienced and two
scored with seven-position scoring, weighted experienced scorers who evaluated a sample
for sample size and number of scorers, was of Federal You-Phase examinations (N = 22)
.883 with a combined inconclusive rate of selected from the confirmed case archive of
.168. Reliability of seven-position scores of the U.S. Department of Defense. Eleven of the
Federal You-Phase exams, reported as the cases were confirmed as deceptive, and 11
bootstrap mean of the proportion of the were confirmed as truthful. Unweighted
pairwise decision agreement, excluding decision accuracy was .906, and the
inconclusive results, was .852. unweighted inconclusive rate was .235.

Federal You-Phase/ESS Figure 7 shows a mean and standard


deviation plot of the ESS scores of the Federal
Two studies describe the criterion You-Phase studies. A two-way ANOVA
accuracy of Federal You-Phase examinations showed that the interaction of sampling
when scored with the ESS. distribution and criterion status was not
significant [F (1,68) = 0.155, (p = .695)], nor
Nelson (2011) used Monte Carlo was the main effect for sampling distribution
methods to calculate the criterion accuracy of [F (1,68) = 0.021, (p = .886)].
Federal You-Phase exams when scored with
the ESS. The Monte Carlo space consisted of The combined decision accuracy level
50 criterion truthful cases and 50 criterion of these studies of Federal You-Phase exams,
deceptive cases. Unweighted decision weighted for sample size and number of
accuracy was .897 and the unweighted scorers, was .904 with a combined inconclu-
inconclusive rate was .096. sive rate of .192, when scored with the ESS
TDA method. Reliability, reported as the
Nelson, Handler, Blalock and bootstrap proportion of pair-wise decision
Cushman (In press) reported the criterion agreement excluding inconclusive results, was
accuracy of Federal You-Phase exams using .897.
ESS scores that were obtained by

Figure 7. Mean deceptive and truthful scores for Federal You-Phase / ESS studies.

Polygraph, 2011, 40(4) 222


Ad Hoc Committee on Validated Techniques

Federal ZCT / Seven-position TDA confirmed deceptive field examinations and 50


confirmed truthful exams selected from the
The Federal ZCT is an event-specific U.S. Department of Defense confirmed case
diagnostic technique constructed with three archive. A total of 1,000 scored results were
RQs. Three studies describe the criterion obtained. Unweighted decision accuracy was
accuracy of the Federal ZCT when scored .852, and the unweighted inconclusive rate
using the seven-position TDA model. was .198.

Blackwell (1998) described the Honts, Amato and Gordon (2004),


criterion accuracy of 100 confirmed Federal reported the criterion accuracy of Federal ZCT
ZCT exams, of which 65 examinations were exams that were evaluated by three scorers
confirmed as deceptive while 35 exams were who used the federal seven-position TDA
confirmed as truthful. A total of 195 scored model. A total of 72 scores were obtained for
results were obtained for criterion deceptive 24 criterion deceptive exams, and 72 scores
exams, and 105 scored results were obtained were obtained for 24 criterion truthful exams.
for criterion truthful exams. Three experi- Unweighted decision accuracy was .958, and
enced federally trained examiners scored all of the unweighted inconclusive rate was .042.
the cases using the seven-position TDA
method.32 Unweighted decision accuracy was Figure 8 shows a mean and standard
.793, and the unweighted inconclusive rate deviation plot of the seven-position scores of
was .159. the Federal You-Phase studies. A two-way
ANOVA showed that the interaction of
Krapohl and Cushman (2006), sampling distribution was not significant, [F
reported the criterion accuracy of Federal ZCT (1,204) = 0.706, (p = .402)], nor was the main
exams scored by a cohort of 10 experienced effect for sampling distribution [F (1,204) =
examiners, each of whom who scored 50 0.004, (p = .951)].

Figure 8. Mean deceptive and truthful scores for Federal ZCT exams with
seven-position TDA.

32The older, pre-2006, Federal TDA model employed more features than the presently used evidence-based Federal
TDA model. However, Kircher et al. (2005) reported that experienced examiners tend to violate rules that do not
work and tend to emphasize procedures that do work. Therefore it is possible that the scores of these examiners
reflect current training and field practices more closely than might be initially assumed.

223 Polygraph, 2011, 40(4)


Validated Techniques

The combined decision accuracy level case archive. Fifty of the examinations were
of these seven-position TDA studies of Federal confirmed as deceptive, and 50 of the exams
ZCT exams, weighted for sample size and were confirmed as truthful. Unweighted
number of scorers, was .860 with a combined decision accuracy was .872, and the
inconclusive rate of .171. Reliability of seven- unweighted inconclusive rate was .073.
position scores of Federal ZCT exams,
reported as the Fleiss' kappa statistic for Nelson and Krapohl (2011) reported
categorical decisions of multiple raters was the criterion accuracy of 60 Federal ZCT
.570, and the pairwise proportion of decision exams that were evaluated by six experienced
agreement excluding inconclusive results was federally trained scorers. Thirty of the
.800. examinations were confirmed as deceptive and
30 exams were confirmed as truthful. Each
Federal ZCT / Seven-position TDA with scorer evaluated a random subset of 10
evidentiary decision rules exams. Results were evaluated using the
Federal seven-position TDA model and
Two studies describe the criterion evidentiary decision rules. Unweighted
accuracy of the Federal You-Phase technique decision accuracy was .870, and the
when scored using the seven-position TDA unweighted inconclusive rate was .100.
model and evidentiary decision rules.
Figure 9 shows a mean and standard
Krapohl and Cushman (2006) deviation plot of the seven-position scores of
described the criterion accuracy of 100 the Federal ZCT studies, using evidentiary
Federal ZCT exams that were scored by 10 rules. A two-way ANOVA showed that the
experienced examiners using the seven- interaction of sampling distribution and
position TDA model and evidentiary decision criterion status was not significant [F (1,146)
rules.33 A total of 1,000 scored results were = 0.001, (p = .981)], nor was the main effect
obtained. Examinations were selected from for sampling distribution [F (1,146) = 0.046, (p
the U.S. Department of Defense confirmed = .830)].

Figure 9. Mean deceptive and truthful scores for Federal ZCT exams with seven-position
TDA and evidentiary decision rules.

33Krapohl (2005) and Krapohl and Cushman (2006) showed that evidentiary decision rules can substantially reduce
inconclusive rates without a corresponding loss of overall decision accuracy.

Polygraph, 2011, 40(4) 224


Ad Hoc Committee on Validated Techniques

The combined decision accuracy level Shurani (2011) reported the results of
of these seven-position TDA studies of Federal a field study involving three examiners from
ZCT exams with the seven-position TDA Costa Rica who used the IZCT along with an
method and evidentiary decision rules, additional experimental technique. The
weighted for sample size and number of sample consisted of 73 cases for which all
scorers, was .872 with a combined possible suspects were tested. Forty-eight
inconclusive rate of .075. Reliability, cases were confirmed, resulting in N = 188
calculated as the bootstrap average of examinations, that were conducted using the
pairwise decision agreement excluding IZCT with three and four RQs.36 Two
inconclusive results, was .870. inconclusive results were removed from the
reported results. No information was reported
Integrated Zone Comparison Technique regarding the number of exams conducted
with three or four RQs. However, data were
The Integrated Zone Comparison provided to the committee for 84 examinations
Technique (IZCT) (Gordon et al., 2000) is a reportedly conducted using three RQs,
proprietary event-specific diagnostic technique including scores for 36 deceptive cases and 48
scored with the Horizontal Scoring System truthful cases. Scores for the remaining 104
(Gordon, 1999).34 Two studies describe the exams were not made available. No sampling
criterion accuracy of this technique. mean or standard deviations were reported,
and the committee was unable to compare the
Gordon et al. (2005; also described in means of the sample data provided to the
Mohamed et al., 2006) reported the results of committee with any published information.
a pilot study involving six guilty and five Results of this study were reported with
innocent subjects35 who participated in a perfect accuracy and zero inconclusive
laboratory scenario involving a mock shooting findings. No reliability statistics were reported
incident. Decision accuracy was reported as for this study, and the committee was unable
1.000, with an unweighted inconclusive rate to calculate interrater reliability from the
of .100. available data.

Shurani and Chaves (2010) reported Figure 10 shows a mean and standard
the results of a survey of 84 field deviation plot of the sampling distributions of
examinations conducted with the IZCT, the IZCT studies. A two-way ANOVA showed a
including 44 scores for confirmed deceptive significant interaction between the sampling
examinees and 40 scores for confirmed distribution and case status [F (1,173) =
truthful examinees. All examinations were 533.771, (p < .001)]. Post-hoc one-way
reportedly verified by confessions, with ANOVAs showed that the sampling differences
extrapolygraphic evidence extant for some for deceptive cases was not significant.
exams. Unweighted decision accuracy was However, the difference in truthful scores for
.988, with an unweighted inconclusive rate of the three samples was significant [F (2,33) =
.061. No reliability statistics were reported for 21.402, (p = .014)]. Truthful scores were
this study, and the committee was unable to significantly greater for the Shurani and
calculate interrater reliability from the Chaves (2010) and Shurani (2011) studies
available data. compared to the Gordon et al. (2005) study.

34 A rank order scoring system based on unique developer-devised measurement features.

35 The original pilot study design included six innocent subjects, however one truthful subject made a false-

confession to the examiner (Gordon, personal communication July 6, 2011) who was also the primary author of the
Gordon et al. (2005) study and developer of the IZCT. Inclusion of the false-positive (false-confession) error case
would have resulted in less than perfect accuracy.

36 No published description exists for the use of the IZCT with four RQs. Because the IZCT is scored using a rank

order paradigm, inclusion of additional RQs without the inclusion of an equivalent number of additional CQs can be
expected to differentially affect the rank-sum scores of relevant and CQs. No published studies described or
investigated these statistical complexities.

225 Polygraph, 2011, 40(4)


Validated Techniques

It was later learned that the Gordon et (2010) and Shurani (2011) samples revealed a
al. (2005) study was conducted using single significant interaction [F (1,164) = 43.140, (p <
issue IZCT exams while the Shurani and .001)], suggesting that the scores of deceptive
Chaves (2010) sample cases were conducted and truthful cases were expressed or
using multi-facet IZCT exams. It is not clear interpreted differently in the Shurani and
whether this difference accounts for the Chavez (2010) and Shurani (2011) study
significant interaction and differences samples. One-way differences were not
observed in these sampling distributions.37 A significant. Scores for the Shurani (2011)
two-way ANOVA comparison, scores x study were further from zero than the scores
sampling distribution, of the sampling for the Shurani and Chaves (2011) study for
distributions from the Shurani and Chavez both deceptive and truthful cases.

Figure 10. Mean deceptive and truthful scores for IZCT samples.

The combined decision accuracy level Matte Quadri-track Zone Comparison


of these IZCT studies, weighted for sample size Technique
and number of scorers, was .994 with a
combined inconclusive rate of .033. The Matte Quadri-track Zone
Comparison Technique (MQTZCT) (Matte &
No reliability statistics were reported Reuss, 1989) is a proprietary event-specific,
for any of the IZCT studies, and the committee single-issue diagnostic technique scored using
was unable to calculate interrater reliability a modification of the Backster numerical
from the available data. system. Three studies describe the criterion
accuracy of the MQTZCT.

37 In a practical sense, differences in assumptions about independence and non-independence among the test

questions will result in the use of different decision rules, and these differences may have had a biasing effect on
case confirmation and sample selection for these field studies. Rank order scores for all RQs are always relative to
all other relevant and comparison test stimuli. Rank order scores are therefore inherently non-independent, and the
mathematical justification for the application of a rank-order scoring model to multi-facet exams, for which the
decision rules are based on the assumption of independence, is not clear. This non-trivial statistical and decision
theoretical complication has not been adequately discussed or studied.

Polygraph, 2011, 40(4) 226


Ad Hoc Committee on Validated Techniques

Matte and Reuss (1989) reported the confirmed by confession along with additional
results of 64 deceptive and 58 truthful cases evidence for some cases. Decision accuracy
that were confirmed through combinations of was reported as .964, with zero inconclusive
confession and other evidence. Unweighted results.
decision accuracy was reported as a perfect
1.000, with an unweighted inconclusive rate Figure 11 shows a mean and standard
of .059. deviation plot of the subtotal scores38 of the
sampling distributions of the three MQTZCT
Mangan, Armitage and Adams (2008) studies.39 A two-way ANOVA revealed a
reported the criterion accuracy of a survey of significant interaction between the sampling
91 deceptive cases and 45 truthful cases that distribution and case status [F (1,261) =
were confirmed via examinee confession. 361.605, (p < .001)]. Although the different
Decision accuracy was again reported as a studies appeared to handle deceptive and
perfect 1.000, with an unweighted truthful cases with different effectiveness,
inconclusive rate of .011. post-hoc one-way ANOVAs showed that the
differences in scores were not significant for
Shurani, Stein and Brand (2009) deceptive cases [F (2,141) = 0.389, (p = 0.678)
reported the criterion accuracy of a survey of or for truthful cases [F (2,122) = 0.264, (p <
28 deceptive and 29 truthful cases that were .768)].

Figure 11. Mean deceptive and truthful per-chart scores for MQTZCT samples.

38Scores for MQTZCT exams are reported as the subtotal per chart, obtained by summing all numerical scores
within each chart. Subtotals described elsewhere in this report involve the between-chart RQ subtotals, obtained by
summing the numerical scores for each RQ for all charts.

39 Data initially provided to the ad hoc committee for the Mangan, Armitage and Adams (2008) and Shurani and

Chaves (2009) studies included only those scores for which the scorers achieved the correct result, and did not
include scores for inconclusive or erroneous results. Missing scores were later provided to the committee for both
the Mangan, Armitage and Adams (2008) and Shurani, Stein and Brand (2009) studies. However, the resulting
sampling distributions were different from those reported for both studies. Because of these troublesome
discrepancies, the statistical analysis was not re-calculated with the missing scores, and the reported analysis
reflects the mean scores as reported by Mangan, Armitage and Adams (2008) and Shurani and Chavez (2009). The
result of this confound is that sampling distributions, as reported, should be considered systematically devoid of
error or unexplained variance, and therefore not generalizable.

227 Polygraph, 2011, 40(4)


Validated Techniques

The combined decision accuracy level Kircher and Raskin (1988) reported the
of these MQTZCT studies, weighted for sample results from two scorers, both of whom scored
size and number of scorers, was .994 with a 50 programmed deceptive and 50
combined inconclusive rate of .029. programmed truthful examinees in a
Reliability for MQTZCT exams was reported by laboratory study. A total of 200 scored results
Matte and Reuss (1989) as .990.40 were obtained. Unweighted decision accuracy
of blind numerical scores was .935, with an
Utah ZCT – Probable Lie Test inconclusive rate of .070.

The Utah ZCT Probable Lie Test,41 also Figure 12 shows a mean and standard
referred to as the Utah Probable Lie Test (PLT), deviation plot of the scores of the sampling
(Handler 2006; Handler & Nelson, 2008) and distributions of the included Utah PLC
the Utah numerical scoring system (Bell et al., studies. A two-way ANOVA showed that the
1999; Handler & Nelson, 2008) were interaction of sampling distribution and
developed by researchers at the University of criterion status was not significant [F (1,63) =
Utah, as a modification of the Backster ZCT 1.682, (p = .200)], nor was the main effect for
(Backster, 1963). Two studies describe the sampling distribution [F (1,63) = 0.108, (p =
criterion accuracy of the Utah PLT. .743)].

Honts, Raskin and Kircher (1987) The combined decision accuracy level
reported the results of 10 programmed of these Utah PLT studies, weighted for
deceptive and 10 programmed truthful sample size and number of scorers, was .931
examinees in a study of polygraph with a combined inconclusive rate of .077.
countermeasures.42 Unweighted decision Reliability for Utah PLC exams, expressed as
accuracy of blind numerical scores was .889, the average of kappa statistics for the two
with an inconclusive rate of .150.43 studies was .730, with a pairwise rate of
overall decision agreement, excluding
inconclusive results, of .975.

40This statistic was published in the Matte and Reuss (1989) reprint of the dissertation published in the journal
Polygraph, but cannot be located in the original dissertations study for the no longer extant Columbia Pacific
University.

41Developers of the Utah technique appear to have given little concern to the name of the test question format, and
this format has also been referred to as the Utah 3-question version and the Utah PLT. Polygraph field examiners
have used the term Utah ZCT because of the obvious similarities with other ZCT variants. The term Utah ZCT is
used in this document to aide in the recognition of the procedural and practical similarities between this technique
and other three-question ZCT formats intended for single-issue event-specific testing.

42 Only the non-countermeasure control group cases are included in this analysis.

43 Honts, Raskin and Kircher (1987) reported mean scores but were not required by editorial and publication

standards to report standard deviations for the sampling distributions of deceptive and truthful and deceptive scores
at the time of publication. Because data were no longer available to calculate these missing statistics, a blunt
estimate of the pooled standard deviation was calculated from the reported t-value for the level of significance of the
difference between truthful and deceptive scores.

Polygraph, 2011, 40(4) 228


Ad Hoc Committee on Validated Techniques

Figure 12. Mean deceptive and truthful total scores for Utah PLT studies.

examinees who participated in a laboratory


Utah ZCT – Directed Lie Test experiment. Unweighted decision accuracy of
blind numerical scores was .856, with an
The Utah ZCT Directed Lie Test (DLT) inconclusive rate of .067.45
is a variant of the Utah PLT, using directed-lie
CQs in place of probable-lie questions. Two Figure 13 shows a mean and standard
studies describe the criterion accuracy of Utah deviation plot of the scores of the sampling
DLC exams. distributions of the included Utah DLT
studies. A two-way ANOVA showed that the
Honts and Raskin (1988) reported the interaction of sampling distribution and
criterion accuracy of Utah DLC exams of 25 criterion status was not significant [F (4,51) =
criminal suspects, including 12 deceptive and 0.705, (p = .592)], nor was the main effect for
13 truthful persons, whose examination were sampling distribution [F (2,51) = 0.009, (p =
later confirmed by confession, evidence, the .991)].
confession of an alternative suspect, or the
retraction of an allegation. Unweighted The combined decision accuracy level
decision accuracy of blind numerical scores of these Utah DLT studies, weighted for
was .958, with an inconclusive rate of .077.44 sample size and number of scorers, was .902
with a combined inconclusive rate of .073.
Horowitz, Kircher, Honts and Raskin Reliability for Utah DLC exams, expressed as
(1997) reported the results of 15 programmed the average of Pearson correlation coefficients
deceptive and 15 programmed truthful for the included studies, was .930.

44 Honts and Raskin (1988) reported mean scores but were not required by editorial and publication standards to

report standard deviations for the sampling distributions of deceptive and truthful and deceptive scores at the time
of publication. Because data were no longer available to calculate these missing statistics, a blunt estimate of the
pooled standard deviation was calculated from the reported F-ratio for the level of significance of the difference
between truthful and deceptive scores.

45 Mean and standard deviation statistics were measured to the nearest 1/2 point from Figure 1 in Horowitz,
Kircher, Honts and Raskin (1997) study report.

229 Polygraph, 2011, 40(4)


Validated Techniques

Figure 13. Mean deceptive and truthful total scores for Utah DLT studies.

Utah ZCT – Canadian Police College/Royal examinations conducted by the Canadian law
Canadian Mounted Police Version enforcement officers using the RCMP version
of the Utah PLT. Twenty-one of the cases
The Canadian Police College (CPC) and were confirmed as deceptive, and 11 of the
the Royal Canadian Mounted Police (RCMP) cases were confirmed as truthful. Unweighted
have developed a variant of the Utah PLT, decision accuracy of blind numerical scores
referred to as the RCMP Zone or the CPC was .969, with an inconclusive rate of .210.
Series A exam. Three studies describe the
criterion accuracy of the Utah RCMP Series A Figure 14 shows a mean and standard
exam. deviation plot of the scores of the sampling
distributions of the included Utah CPC-RCMP
Honts, Hodes and Raskin (1985) studies. A two-way ANOVA showed that
reported the criterion accuracy of Utah PLT neither the interaction of sampling
exams using the test question sequence of the distributions and criterion status [F (1,99) =
RCMP Series A exam, including 19 deceptive 0.562, (p = .455)], nor the main effect for
and 19 truthful cases. Unweighted decision sampling distribution [F (1,99) = 0.109, (p =
accuracy of blind numerical scores was .833, .742)] were statistically significant.
with an inconclusive rate of .237.
The combined decision accuracy level
Driscoll, Honts and Jones (1987) of these Utah CPC-RCMP studies, weighted for
reported the criterion accuracy Utah PLT sample size and number of scorers, was .939
exams, using the results of 20 programmed with a combined inconclusive rate of .183.
deceptive and 20 programmed truthful Reliability for Utah RCMP exams was reported
examinees who were recruited from a group by Honts (1996) as Kappa = .480 for
counseling program at a Veterans Center, categorical decision agreement adjusted for
using the test question sequence the RCMP chance agreement. The average pairwise
Series A exam. Decision accuracy was Pearson correlation coefficient for numerical
reported as 1.000, with an unweighted scores of the included studies was .940, and
inconclusive rate of .100. the average proportion of decision agreement,
excluding inconclusive results, was .883.
Honts (1996) reported the results of a
survey of criterion accuracy of field

Polygraph, 2011, 40(4) 230


Ad Hoc Committee on Validated Techniques

Figure 14. Mean deceptive and truthful total scores for Utah CPC-RCMP studies.

Utah ZCT – Combined PLT, DLT and RCMP for sampling distribution [F (2,246) = 0.02, (p
Studies = .980)]. Because the interaction was
approaching a significant level, one-way post-
Figure 15 shows a mean and standard hoc ANOVAs were also completed. Differences
deviation plot for the three variants of the between the sampling distributions were not
Utah PLT. A two-way ANOVA showed that the significant for the deceptive scores [F (2,100) =
interaction between the test variant and 0.042, (p = .959)] or for the truthful scores [F
criterion status was not significant [F (1,246) (2,100) = 0.008, (p = .992)].
= 2.553, (p = .111)], nor was the main effect

Figure 15. Mean deceptive and truthful total scores for three variants of the Utah ZCT.

231 Polygraph, 2011, 40(4)


Validated Techniques

Unweighted decision accuracy for Blalock, Cushman and Nelson (2009),


seven included studies pertaining to the three in a replication study, reported the criterion
variants of the Utah technique, weighted for accuracy of a group of nine examiner trainees
sample size and number of scorers, was .930, who used the ESS to evaluate a sample of 100
with an unweighted inconclusive rate of .107. exams selected from the U.S. Department of
Reliability statistics were averaged for all Defense confirmed case archive. Fifty of the
included Utah ZCT studies, and produced an examinations were confirmed as deceptive,
average reliability statistic of kappa = .647. and 50 of the exams were confirmed as
The average rate of decision agreement truthful. A total of 900 scored results were
excluding inconclusive results was .958, and obtained. Unweighted decision accuracy of
the average Pearson correlation coefficient for blind numerical scores was .870, with an
numerical scores was .913. inconclusive rate of .138.

Event-Specific ZCT / ESS Nelson, Blalock, Oelrich and Cushman


(2011), reported the results of a reliability
The ESS is an evidence-based TDA study involving 25 experienced examiners who
model that includes normative data for ZCT used the ESS to evaluate a sample of 10
examinations and other PDD techniques. examinations selected from the U.S.
Because ESS transformations are non- Department of Defense confirmed case
parametric, ESS scores are sensitive to archive. Six of the cases were confirmed as
differences in response magnitude yet robust deceptive, and four cases were confirmed as
against differences in the linearity of response truthful. A total of 250 scored results were
magnitude. obtained. The pairwise proportion of decision
agreement was .950, and the unweighted
Nelson et al. (2011) reported a average of correct decisions excluding
summary of five previous criterion accuracy inconclusive results was .958. The
studies of ESS scores of ZCT examinations, unweighted inconclusive rate was .102.
including results reported by Nelson, Krapohl
and Handler (2008), Blalock, Cushman and Handler, Nelson, Goodson and Hicks
Nelson (2009), Nelson, Blalock, Oelrich and (2010) reported the criterion accuracy of 19
Cushman (2011), Handler, Nelson, Goodson examiner trainees from the México Policía
and Hicks (2010) and Nelson and Krapohl Federal, who used the ESS to evaluate 100
(2011). These studies included 5,192 scored examinations selected from the U.S.
results from 140 scorers who evaluated 732 Department of Defense confirmed case
individual examinations. Those results archive. Fifty of the examinations were
consisted of 2,671 scored results of 384 confirmed as deceptive, and 50 of the exams
confirmed deceptive examinations, and 2,521 were confirmed as truthful. A total of 1,900
scored results of 348 confirmed truthful scored results were obtained. Unweighted
exams. Examinations included both Federal decision accuracy of blind numerical scores
ZCT and Utah ZCT exams. Unweighted was .901, with an inconclusive rate of .040.
decision accuracy of these scores, excluding
inconclusive results was .921, and the Nelson and Krapohl (2011) reported
unweighted inconclusive rate was .098. the criterion accuracy of transformed ESS
scores from six experienced federally trained
Nelson, Krapohl and Handler (2008) examiners who evaluated a sample of 60
reported the criterion accuracy of ESS scores examinations selected from the U.S.
of seven inexperienced examiner trainees who Department of Defense confirmed case
used the ESS to evaluate a sample of 100 archive. Each examiner scored 10 cases.
exams selected from the U.S. Department of Thirty of the examinations were confirmed as
Defense confirmed case archive. Fifty of the deceptive, and 30 of the exams were
examinations were confirmed as deceptive, confirmed as truthful. Unweighted decision
and 50 of the exams were confirmed as accuracy of blind numerical scores was .913,
truthful. A total of 700 scored results were with an unweighted inconclusive rate of .020.
obtained. Unweighted decision accuracy of
blind numerical scores was .872, with an Remaining scores of the Nelson et al.
inconclusive rate of .103. (2011) results consisted of 1,382 scored

Polygraph, 2011, 40(4) 232


Ad Hoc Committee on Validated Techniques

results of 572 individual examinations. These 10 cases each from the holdout sample. One
results consisted of 741 scored results of 304 subset of 10 cases was scored by two of the
confirmed deceptive examinations, and 641 Ohio police trainees.
scored results of 268 confirmed truthful
exams. Data from the Krapohl and Cushman Numerical scores from the Kircher,
(2006) study were scored using an automated Kristjiansson, Gardner and Webb (2005) study
version of the ESS, including 50 confirmed (N = 80) were transformed to ESS scores,
deceptive examinations and 50 confirmed including 40 scores of confirmed deceptive
truthful exams selected from the U.S. exams and 40 scores of confirmed truthful
Department of Defense confirmed case exams. Data from the OSS development
archive. These exams were also scored by a sample (Krapohl, 2002; Krapohl & McManus,
cohort of 11 examiner trainees from the 1999; Nelson, Krapohl & Handler, 2008) were
Colombian Army Counterintelligence Unit, evaluated using an automated model of the
who used the ESS in pairs of two and three ESS, including 149 scores for confirmed
examiners to score 10 cases each. Data from deceptive exams and 143 scores for confirmed
the holdout sample used by Krapohl and truthful exams. Seven-position scores from
McManus (1999) were also scored using an two experts who participated in the Kircher
automated version of the ESS, including 30 and Raskin (1988) study were transformed to
confirmed deceptive examinations and 30 ESS scores, including 100 scores for 50
confirmed truthful exams. This holdout examinations of programmed deceptive exami-
sample was also evaluated by a cohort of 35 nees, and 100 scores for 50 exams conducted
scorers from Romania, consisting of 15 on programmed truthful examinees. Finally,
international polygraph examiners and 20 seven-position scores from three expert
researchers, psychologists and graduate scorers who evaluated the cases for the
students, from the University of Iasi in Blackwell (1998) study were transformed to
Romania, who used the ESS while working in ESS scores, including 195 scores for 65
teams to score subsets of 10 cases each. The confirmed deceptive examinations, and 105
holdout sample was also scored by a cohort of scores for 35 confirmed truthful exams.
12 examiner trainees from the Panama
National Police who worked in teams to score Figure 16 shows a mean and standard
subsets of 10 cases each. In addition, seven deviation plot of the scores of the sampling
examiner trainees from police agencies in the distributions of the included ZCT ESS studies.
state of Ohio used the ESS to score subsets of A two-way ANOVA showed that the interaction

Figure 16. Mean deceptive and truthful total scores for ZCT ESS studies.

233 Polygraph, 2011, 40(4)


Validated Techniques

of sampling distribution and criterion status Criterion Accuracy For All Validated
was not significant [F (1,215) = 0.205, (p = Techniques
.651)], nor was the main effect for sampling A one-way ANOVA for unweighted
distribution [F (5,215) = 0.164, (p = .976)]. decision accuracy showed that differences in
unweighted decision accuracy for these 14
The combined decision accuracy level PDD techniques were significant [F (13,5119)
of the included ZCT ESS studies, weighted for = 2.753, (p < .001)]. One-way ANOVAs for
sample size and number of scorers, was .922 case status showed that differences in correct
with a combined inconclusive rate of .098, as decisions were significant for both criterion
reported by Nelson et al. (2011). Reliability deceptive cases [F (13,2494) = 1.982, (p =
statistics for ZCT ESS studies were average to .019)] and criterion truthful cases [F (13,2542)
produce a pairwise proportion of decision = 2.764, (p <. .001)]. Figure 17 shows the
agreement, excluding inconclusive results, of mean and confidence intervals for the un-
.950, with average Kappa = .585. weighted accuracy of 14 techniques included
in the meta-analysis.

Figure 17. Mean and confidence intervals for unweighted decision accuracy of 14
PDD techniques.

A series of ANOVA contrasts showed unweighted accuracy for the remaining 12


that differences were significant only for two PDD techniques. One-sample t-tests further
PDD techniques, the IZCT and the MQTZCT. confirmed the outlier status of the results of
Exclusion of these two PDD techniques these two techniques: t = 212.268 (p < .001)
resulted in no significant differences [F for both the IZCT and the MQTZCT.46 A series
(11,4859) = 0.949, (p = .491)] in the of leave-one-out t-tests revealed that none of

46t-values are the same for both of these techniques because the unweighted mean of weighted sampling means is
the same for both techniques (.994).

Polygraph, 2011, 40(4) 234


Ad Hoc Committee on Validated Techniques

the other techniques produced outlier results produced a decision accuracy of .869 (.036)
when compared to the results of all other without inconclusive results. The 95%
techniques. confidence range was from .798 to .940. The
mean inconclusive rate, excluding outlier
APA 2012 criterion validity standards results, was .128 (.030) with a 95% confidence
Table 1 (also shown in the Executive range of .068 to .187. Aggregated reliability
Summary) shows a list of the 14 PDD statistics produces a mean Kappa statistic of
techniques that satisfied the requirements for .642 (.102) with a 95% confidence range of
inclusion in this meta-analysis at criterion .443 to .842. The mean rate of inter-rater
accuracy levels specified in the APA 2012 decision agreement, excluding outlier results
standard requirements for evidentiary testing, and excluding inconclusive results, was .901
paired-testing, and investigative testing. Also (.082) with a 95% confidence range from .741
shown in Table 1 are the unweighted decision to .999. The mean Pearson correlation
accuracy and inconclusive rates for each PDD coefficient for numerical scores, excluding
technique. Additional details concerning the outlier results, was .876 (.116) with a 95%
sampling distributions and a complete confidence range of .649 to .999. Table 2
dimensional profile of criterion accuracy for shows the aggregated criterion accuracy
each of these techniques and all included profile of all validated CQT PDD techniques,
studies can be found in Appendix E. weighted for the sample size and number of
scorers.47 Also shown in Table 2 is the
The combination of all validated PDD criterion accuracy profile including outlier
techniques, excluding outlier results, results.

47A majority of examinations in field polygraph programs are conducted using PDD techniques that are interpreted
with an assumption of criterion independence among the RQs. However, a majority of PDD criterion validity
research has been conducted using PDD techniques that are interpreted with an assumption of non-independence.
Non-independent examination techniques have greater statistical discrimination power and greater accuracy than
independent exam techniques. For this reason, the unweighted average was considered to be a more conservative
and generalizable estimate of the overall accuracy of all PDD examination techniques. This was calculated as the
unweighted average of the weighted aggregation of independent techniques and the weighted aggregation of non-
independent techniques. Calculation of the weighted average would result in an overestimation of accuracy. The
unweighted average of PPV and NPV might be a more optimistic and flattering under some conditions, but can be
expected to be less generalizable to field circumstances when base rates are unknown or different than the base
rates in the study samples.

235 Polygraph, 2011, 40(4)


Validated Techniques

Table 1. Mean (standard deviation) and {95% confidence intervals} for correct decisions (CD) and inconclusive results (INC)
for validated PDD techniques.
Evidentiary Techniques/ Paired Testing Techniques/ Investigative Techniques/
TDA Method TDA Method TDA Method
Federal You-Phase / ESS1 AFMGQT4,8 / ESS5 AFMGQT6,8 / 7 position
CD = .904 (.032) {.841 to .966} CD = .875 (.039) {.798 to .953} CD = .817 (.042) {.734 to .900}
INC = .192 (.033) {.127 to .256} INC = .170 (.036) {.100 to .241} INC = .197 (.030) {.138 to .255}
Event-Specific ZCT / ESS Backster You-Phase / Backster CIT7 / Lykken Scoring
CD = .921 (.028) {.866 to .977} CD = .862 (.037) {.787 to .932} CD = .823 (.041) {.744 to .903}
INC = .098 (.030) {.039 to .157} INC = .196 (.040) {.117 to .275} INC = NA
IZCT / Horizontal2 Federal You-Phase / 7 position DLST (TES)8 / 7 position
CD = .994 (.008) {.978 to .999} CD = .883 (.035) {.813 to .952} CD = .844 (.039) {.768 to .920}
INC = .033 (.019) {.001 to .069} INC = .168 (.037) {.096 to .241} INC = .088 (.028) {.034 to .142}
MQTZCT / Matte3 Federal ZCT / 7 position DLST (TES)8 / ESS
CD = .994 (.013) {.968 to .999} CD = .860 (.037) {.801 to .945} CD = .858 (.037) {.786 to .930}
INC = .029 (.015) {.001 to .058} INC = .171 (.040) {.113 to .269} INC = .090 (.026) {.039 to .142}
Utah ZCT DLT / Utah Federal ZCT / 7 pos. evidentiary
CD = .902 (.031) {.841 to .962} CD = .880 (.034) {.813 to .948} -
INC = .073 (.025) {.023 to .122} INC = .085 (.029) {.028 to .141}
Utah ZCT PLT / Utah
CD = .931 (.026) {.879 to .983} - -
INC = .077 (.028) {.022 to .133}
Utah ZCT Combined / Utah
CD = .930 (.026) {.875to .984} - -
INC = .107 (.028) {.048 to .165}
Utah ZCT CPC-RCMP Series A / Utah
CD = .939 (.038) {.864 to .999} - -
INC = .185 (.041) {.104 to .266}
1
Empirical Scoring System.
2
Generalizability of this outlier result is limited by the fact that no measures of test reliability have been published for this technique. Also, significant
differences were found in the sampling distributions of the included studies, suggesting that the samples data are not representative of each other, or
that the exams were administered and/or scored differently. One of the studies involved a small sample (N = 12) that was reported in two articles, for
which the participating scorer was also the technique developer. One of the publications described the study as a non-blind pilot study. Both reports
indicated that one of the six truthful participants was removed from the study after making a false-confession. The reported perfect accuracy rate did
not include the false confession. Neither the perfect accuracy nor the .167 false-confession rate are likely to generalize to field settings.
3
Generalizability of this outlier result is limited by the fact that the developers and investigators have advised the necessity of intensive training
available only from experienced practitioners of the technique, and have suggested that the complexity of the technique exceeds that which other
professionals can learn from the published resources. The developer reported a near-perfect correlation coefficient of .99 for the numerical scores,
suggesting an unprecedented high rate of inter-scorer agreement, which is unexpected given the purported complexity of the method. Additionally,
the data initially provided to the committee for replication studies included only those cases for which the scorers arrived at the correct decision,
excluding scores from those cases for which the scorers did not achieve the correct decision. Missing scores were later provided to the committee
for both the Mangan et al (2008) and Shurani and Chavez (2009) studies. However, the resulting sampling means were different from those
reported for both replication studies. Because of these discrepancies, the statistical analysis was not re-calculated with the missing scores, and the
reported analysis reflects the sampling distribution means as reported. Sampling means for replication studies should be considered devoid of error
or uncontrolled variance.
4
Two versions exist for the AFMGQT, with minor structural differences between them. There is no evidence that the performance of one version is
superior to the other. Because replicated evidence would be required to reject a null-hypothesis that the differences are meaningless, and because
the selected studies include a mixture of both AFMGQT versions, these results are provided as a generalizable to both versions. AFMGQT exams
are used in both multi-facet event-specific contexts and multi-issue screening contexts. Both multi-facet and multi-issue examinations were
interpreted with decision rules based on an assumption of criterion independence among the RQs.
5
The AFMGQT produced accuracy that is satisfactory for paired testing only when scored with the Empirical Scoring System.
6
There are two techniques for which there are no published studies but which are structurally nearly identical to the AFMGQT: the LEPET and the
Utah MGQT. Validity of the AFMGQT can be generalized to these techniques if scored with the same TDA methods.
7
Concealed Information Test, also referred to as the Guilty Knowledge Test (GKT) and Peak of Tension test (POT). The data used here were
provided in the meta-analysis report of laboratory research by MacLaren (2001).

8
Studies for these PDD techniques were conducted using decision rules based on the assumption of criterion independence among the testing
targets. Accuracy of screening techniques may be further improved by the systematic use of a successive-hurdles approach.

Polygraph, 2011, 40(4) 236


Ad Hoc Committee on Validated Techniques

Table 2. Mean (standard deviation) and {95% confidence Interval} for criterion accuracy profiles
for all validated PDD techniques combined.

Excluding outlier results All included studies

Number of PDD Techniques 12 14


Number of Studies 39 45
N Deceptive 2,067 2,336
N Truthful 1,802 2,031
Total N 3,869 4,367
Number Scorers 280 295
N of Deceptive Scores 5,840 6,109
N of Truthful Scores 5,399 5,628
Total Scores 11,239 11,737
.869 (.036) .887 (.033)
Percent Correct
{.798 to .940} {.823 to .951}
.128 (.030) .114 (.028)
Inconclusive
{.068 to .187} {.058 to .170}
.812 (.056) .835 (.051)
Sensitivity
{.702 to .923} {.734 to .936}
.717 (.061) .751 (.058)
Specificity
{.597 to .838} {.638 to .864}
.083 (.038) .072 (.035)
FN Errors
{.008 to .157} {.004 to .141}
.144 (.049) .123 (.043)
FP Errors
{.048 to .239} {.039 to .208}
.105 (.042) .092 (.037)
D Inc
{.022 to .187} {.020 to .165}
.151 (.042) .136 (.041)
T Inc
{.068 to .234} {.056 to .216}
.854 (.049) .874 (.044)
PPV
{.757 to .950} {.789 to .960}
.899 (.047) .911 (.043)
NPV
{.807 to .990} {.827 to .995}
.909 (.042) .921 (.039)
D Correct
{.826 to .992} {.844 to .997}
.829 (.056) .854 (.049)
T Correct
{.721 to .938} {.757 to .950}

237 Polygraph, 2011, 40(4)


Validated Techniques

Table 3 shows the criterion accuracy 2012 standards. Also shown in Table 3 is the
profile of the weighted aggregation of PDD criterion accuracy profile for evidentiary
techniques at the evidentiary, paired testing techniques without the results of the two
and investigative levels according to the APA outlier techniques.

Table 3. Criterion accuracy profiles for evidentiary, paired-testing, and investigative techniques.
Evidentiary Evidentiary Paired-testing Investigative
Techniques w/o Outlier Studies Techniques Techniques
Number of Techniques 5 3 5 4
Number of Studies 21 15 12 12
N Deceptive 861 592 435 1,040
N Truthful 776 547 408 847
Total N 1,637 1,139 843 1,887
Number Scorers 174 159 56 65
N of Deceptive Scores 3,297 3,028 1,700 1,112
N of Truthful Scores 3,098 2,869 1,613 917
Total Scores 6,395 5,897 3,313 2,029
Unweighted Average .910 (.027) .903 (.028) .867 (.036) .844 (.039)
Accuracy {.857 to .963} {.847 to .958} {.796 to .938} {.767 to .920}
Unweighted Average .090 (.029) .095 (.030) .142 (.036) .114 (.028)
Inconclusives {.032 to .147} {.035 to .154} {.071 to .213} {.060 to .168}
Sensitivity .843 (.050) .832 (.053) .828 (.051) .802 (.047)
{.745 to .941} {.729 to .935} {.728 to .928} {.710 to .893}
Specificity .826 (.054) .816 (.055) .670 (.071) .771 (.073)
{.721 to .931} {.708 to .923} {.531 to .809} {.627 to .915}
FN Errors .082 (.033) .089 (.034) .060 (.032) .158 (.042)
{.018 to .147} {.021 to .156} {.001 to .123} {.076 to .240}
FP Errors .083 (.035) .090 (.037) .159 (.052) .159 (.070)
{.014 to .152} {.018 to .162} {.056 to .261} {.022 to .296}
D Inc .080 (.038) .086 (.041) .112 (.043) .038 (.020)
{.005 to .155} {.004 to .167} {.028 to .195} {.001 to .077}
T Inc .099 (.044) .104 (.044) .170 (.058) .073 (.015)
{.014 to .185} {.017 to .191} {.056 to .284} {.043 to .102}
PPV .915 (.034) .908 (.037) .847 (.050) .860 (.037)
{.848 to .982} {.836 to .979} {.749 to .945} {.788 to .933}
NPV .904 (.042) .898 (.043) .920 (.046) .812 (.082)
{.823 to .986} {.814 to .982} {.829 to .999} {.651 to .973}
D Correct .911 (.037) .904 (.039) .932 (.036) .837 (.045)
{.839 to .983} {.828 to .98} {.862 to .999} {.749 to .924}
T Correct .908 (.038) .901 (.040) .804 (.064) .827 (.072)
{.833 to .983} {.822 to .980} {.678 to .930} {.686 to .968}

Polygraph, 2011, 40(4) 238


Ad Hoc Committee on Validated Techniques

Figure 18 shows the mean and inconclusive results [F (1,10896) = 3562.384,


statistical confidence intervals for correct (p < .001)], suggesting that techniques at
decisions and inconclusive results for three these different categorical levels may handle
levels of criterion validity described by the APA deceptive and truthful cases with different
2012 standards. Two-way ANOVAs, including effectiveness. However, post-hoc one-way
outlier results, showed a significant ANOVAs showed there was no significant
interaction between criterion status and difference in the proportion of correct
validation category for correct decisions [F decisions inconclusive results, or errors for
(1,10896) = 7433.144, (p < .001)] and for deceptive or truthful cases.

Figure 18. Mean and confidence interval plot for APA validation categories.

Evidentiary testing techniques


Five PDD techniques were reported to Table 4 shows the criterion accuracy
produce both sufficiently high levels of profiles for the five PDD techniques that
diagnostic accuracy and low inconclusive satisfy the APA 2012 requirements for
rates that satisfy the APA 2012 standard evidentiary/diagnostic testing. Also shown in
requirements for evidentiary testing. Scores Table 4 are the number of included studies for
from 21 surveys and experiments were each PDD technique, the total number of
summarized to describe the criterion validity scored results, reliability along with the mean
of these evidentiary techniques. Studies that and standard deviations of the average
support these evidentiary techniques included deceptive and truthful scores of the included
174 scorers who provided 7,407 numerical studies. Mean test sensitivity, test specificity,
scores for 1,637 confirmed exams, including and unweighted accuracy have been reported
3,821 numerical scores for 861 confirmed at levels that are statistically significantly
deceptive examinations and 3,586 numerical greater than chance (50%) for each of these
scores for 776 confirmed truthful five PDD techniques.
examinations.

239 Polygraph, 2011, 40(4)


Validated Techniques

Table 4. Criterion accuracy profiles for evidentiary/diagnostic PDD techniques.


Technique Federal Utah PLT
IZCT* MQTZCT* ZCT ESS
You-Phase (combined)
TDA Method ESS Horizontal Matte Utah ESS
Number of Studies 2 3 3 7 6
N Deceptive 61 86 183 147 384
N Truthful 61 93 139 138 348
Total N 122 179 319 285 732
Number Scorers 11 8 7 8 140
N of Deceptive Scores 160 86 183 197 2671
N of Truthful Scores 160 93 136 188 2521
Total Scores 320 179 319 385 5192
Mean D -7.512 -21.505 -8.711 -10.885 -10.46
StDev D 6.184 12.606 2.489 7.878 8.949
Mean T 6.146 19.626 5.226 9.372 8.219
StDev T 6.217 4.232 3.479 8.066 8.051
Reliability - Kappa - -† - .650 0.59

Reliability - Agreement .900 - - .960 .950
Reliability - Correlation - -

.990‡ .910 -
Unweighted Average .904 (.032) .994 (.008) .994 (.013)* .930 (.028) .921 (.028)
Accuracy {.841 to .966} {.978 to .999} {.968 to .999} {.875 to .984} {.866 to .977}
Unweighted Average .192 (.033) .033 (.019) .029 (.015) .107 (.030) .098 (.030)
Inconclusives {.127 to .256} {.001 to .069} {.001 to .058} {.048 to .165} {.039 to .157}
.845 (.052) .977 (.020) .967 (.021) .853 (.049) .817 (.056)
Sensitivity
{.742 to .948} {.937 to .999} {.926 to .999} {.757 to .948} {.706 to .927}
.757 (.064) .946 (.035) .963 (.033) .809 (.056) .846 (.051)
Specificity
{.633 to .882} {.878 to .999} {.899 to .999} {.699 to .918} {.747 to .946}
.034 (.026) .012 (.015) .011 (.021) .051 (.031) .077 (.037)
FN Errors
{.001 to .085} {.001 to .041} {.001 to .052} {.001 to .112} {.004 to .151}
.138 (.050) .001 (.005) .001 (.015) .074 (.038) .064 (.034)
FP Errors
{.039 to .236} {.001 to .01} {.001 to .03} {.001 to .148} {.001 to .130}
.128 (.046) .012 (.014) .022 (.001) .096 (.040) .106 (.044)
D INC
{.037 to .219} {.001 to .040} {.022 to .022} {.017 to .176} {.020 to .192}
.255 (.044) .054 (.035) .037 (.029) .117 (.046) .089 (.042)
T INC
{.170 to .341} {.001 to .122} {.001 to .094} {.027 to .207} {.008 to .171}
.860 (.050) .999 (.004) .999 (.015) .923 (.039) .931 (.038)
PPV
{.761 to .958} {.991 to .999} {.970 to .999} {.847 to .999} {.857 to .999}
.957 (.033) .989 (.019) .985 (.021) .938 (.036) .912 (.042)
NPV
{.892 to .999} {.952 to .999} {.944 to .999} {.867 to .999} {.830 to .993}
.961 (.029) .988 (.015) .989 (.021) .944 (.034) .913 (.042)
D Correct
{.903 to .999} {.959 to .999} {.948 to .999} {.877 to .999} {.831 to .996}
.846 (.056) .999 (.005) .999 (.015) .916 (.043) .929 (.037)
T Correct
{.736 to .956} {.989 to .999} {.969 to .999} {.832 to .999} {.857 to .999}
* Outlier results that differ significantly from the normal range of the other techniques.

† No reliability data has been published for any of the studies on the IZCT.
‡ A correlation coefficient of .990 is an extraordinary and remarkable finding in any field of research, and suggests an extremely low
rate of disagreement between the numerical scores of blind evaluators using the MQTZCT. This statistic cannot be found in the
Matte and Reuss (1989) dissertation paper for the now defunct Columbia Pacific University, but was published in the included Matte
and Reuss (1989) reprint in Polygraph. Despite this extremely high correlation of numerical scores from different scorers,
developers and researchers of the MQTZCT have expressed repeated cautions regarding the lack of generalizability of MQTZCT
results without intensive proprietary training.

Polygraph, 2011, 40(4) 240


Ad Hoc Committee on Validated Techniques

A two-by-five way ANOVA, criterion (4,828) = 3.118 (p = .015)], while differences in


status x technique, for correct decisions the rate of correct decisions for deceptive
showed a significant interaction between tech- cases were not significant. A series of ANOVA
nique and case status [F (1,7397) = 8944.964, contrasts showed that differences were
(p < .001)], indicating that these five different significant only for the two outlier techniques.
evidentiary techniques handled deceptive and There were no significant differences when the
truthful cases differently. outliers were not included. Figure 19 shows a
mean and confidence interval plot for correct
Post-hoc one-way ANOVAs showed decisions and inconclusive results of the five
that differences in the rate of correct decisions evidentiary techniques.
for criterion truthful cases were significant [F

Figure 19. Means and confidence intervals for evidentiary techniques.

Paired-testing techniques examinations and 1,613 numerical scores for


Five techniques were identified as 408 confirmed truthful examinations.
providing a sufficient level of accuracy to
satisfy the APA requirement for paired- Table 5 shows the criterion accuracy
testing.48 Scores from 12 surveys and profiles for the five PDD techniques that
experiments were summarized to describe the satisfy the APA 2012 requirements for paired
criterion validity of these paired-testing testing. Also shown in Table 5 are the
techniques. Studies that support these number of included studies for each PDD
paired-testing techniques included 56 scorers technique, the total number of scored results,
who provided 3,313 numerical scores for 843 reliability along with the mean and standard
confirmed exams, including 1,700 numerical deviations of the average deceptive and
scores for 435 confirmed deceptive truthful scores of the included studies.

48 All PDD techniques that meet the APA 2012 standard requirement for evidentiary testing also meet the

requirements for paired-testing and investigative testing. Those PDD techniques that that meet the criterion
accuracy requirement for paired-testing are also sufficiently valid for investigative testing.

241 Polygraph, 2011, 40(4)


Validated Techniques

Although test sensitivity and unweighted produced test specificity levels that were not
decision accuracy is significantly greater than significantly greater than chance (Backster
chance for all five of these paired testing You-Phase/Backster, Federal You-Phase/7-
techniques, three of these techniques position, & Federal ZCT/7-position).

Table 5. Criterion accuracy profiles for paired-testing techniques.


Backster Federal
Technique Federal ZCT Federal ZCT AFMGQT
You-Phase You-Phase
7-position
TDA Method Backster 7-position 7-position ESS
evidentiary
Number of Studies 2 2 3 2 3
N Deceptive 61 61 139 80 94
N Truthful 61 61 109 80 97
Total N 122 122 248 160 191
Number Scorers 8 11 16 16 5
N of Deceptive Scores 127 160 767 530 116
N of Truthful Scores 127 160 677 530 119
Total Scores 254 320 1,444 1,060 235
Mean D -16.055 -7.195 -8.577 -8.263 -2.960
StDev D 7.417 5.824 9.018 9.032 4.765
Mean T 5.216 5.999 7.466 7.852 3.738
StDev T 10.291 5.893 8.472 9.721 4.104
Reliability - Kappa - - .570 - -
Reliability - Agreement - .850 .800 .870 1
Reliability - Correlation .567 - - - .930
Unweighted Average .862 (.037) .883 (.035) .860 (.037) .880 (.034) .875 (.039)
Accuracy {.787 to .932} {.813 to .952} {.788 to .931} {.813 to .948} {.798 to .953}
Unweighted Average .196 (.040) .168 (.037) .171 (.040) .085 (.029) .170 (.036)
Inconclusives {.117 to .275} {.096 to .241} {.093 to .249} {.028 to .141} {.100 to .241}
.836 (.052) .841 (.050) .858 (.051) .804 (.054) .729 (.065)
Sensitivity
{.734 to .938} {.742 to .939} {.759 to .957} {.697 to .911} {.603 to .856}
.556 (.070) .632 (.069) .581 (.073) .809 (.057) .700 (.063)
Specificity
{.418 to .694} {.497 to .768} {.438 to .723} {.698 to .920} {.577 to .823}
.007 (.012) .028 (.023) .033 (.029) .110 (.044) .092 (.046)
FN Errors
{-.016 to .03} {.001 to .073} {.001 to .090} {.024 to .197} {.002 to .182}
.207 (.058) .161 (.051) .188 (.051) .109 (.044) .112 (.047)
FP Errors
{.091 to .322} {.061 to .261} {.089 to .287} {.022 to .196} {.02 to .204}
.156 (.051) .131 (.046) .110 (.044) .087 (.039) .178 (.056)
D INC
{.055 to .257} {.041 to .221} {.023 to .196} {.010 to .163} {.068 to .289}
.236 (.059) .205 (.057) .232 (.064) .083 (.040) .162 (.047)
T INC
{.119 to .354} {.093 to .318} {.106 to .358} {.003 to .162} {.071 to .254}
.801 (.055) .840 (.053) .838 (.053) .880 (.048) .864 (.058)
PPV
{.693 to .909} {.736 to .943} {.734 to .943} {.786 to .974} {.751 to .977}
.987 (.021) .958 (.035) .940 (.046) .880 (.048) .887 (.052)
NPV
{.945 to .999} {.889 to .999} {.851 to .999} {.786 to .974} {.785 to .989}
.991 (.014) .968 (.027) .963 (.033) .879 (.048) .888 (.057)
D Correct
{.963 to .999} {.916 to .999} {.898 to .999} {.786 to .973} {.777 to .999}
.728 (.073) .797 (.064) .756 (.067) .881 (.048) .862 (.053)
T Correct
{.584 to .873} {.672 to .923} {.625 to .887} {.786 to .976} {.758 to .967}

Polygraph, 2011, 40(4) 242


Ad Hoc Committee on Validated Techniques

Figure 20 shows a mean and that these five paired testing techniques
confidence interval plot for correct decisions produced different rates of correct decisions
and inconclusive results of the five paired for deceptive and truthful cases. Post-hoc
testing techniques. A two-by-five way ANOVA, one-way ANOVAs showed no significant
criterion status x technique, for correct differences in the rate of correct decisions for
decisions showed a significant interaction [F criterion deceptive cases or criterion truthful
(1,3303) = 5891.333, (p < .001)], indicating cases.

Figure 20. Means and confidence intervals for paired-testing techniques.

A two-by-five way ANOVA, criterion numerical scores for 1,887 confirmed exams,
status x technique, for inconclusive decisions including 1,112 numerical scores for 1,040
showed a significant interaction [F (1,3303) = confirmed deceptive examinations and 917
5891.333, (p < .001)], indicating that these numerical scores for 847 confirmed truthful
five paired-testing techniques produced examinations.
different rates of inconclusive results for
deceptive and truthful cases. Post-hoc one- Table 6 shows the criterion accuracy
way ANOVAs showed that differences in profiles for the four PDD techniques that
inconclusive rates were not significant for satisfy the APA 2012 requirements for
criterion deceptive or criterion truthful cases. investigative testing. Also shown in Table 6
are the number of included studies for each
Investigative testing techniques PDD technique, the total number of scored
Four PDD techniques produced results, reliability along with the mean and
criterion accuracy that satisfies the APA standard deviations of the average deceptive
requirement for investigative testing. Scores and truthful scores of the included studies.
from 12 surveys and experiments were Unweighted decision accuracy and test
summarized to describe the criterion validity sensitivity has been reported as significantly
of these investigative techniques.49 Studies greater than chance for all four of these
that support these investigative techniques investigative techniques. Three of these
included 65 scorers who provided 2,029 investigative techniques, the CIT, and

49 One of the included studies was a meta-analysis that summarized the results of laboratory studies using the CIT.

243 Polygraph, 2011, 40(4)


Validated Techniques

DLST/TES scored with both the seven- technique, the AFMGQT scored with the
position and ESS models, produced test seven-position model, was not significantly
specificity that was significantly greater than greater than chance.
chance. Test specificity for one investigative

Table 6. Criterion accuracy for investigative techniques.


Technique CIT/GKT DLST/TES DLST/TES AFMGQT
TDA Method Lykken 7-position ESS 7-position
Number of Studies 39 4 4 3
N Deceptive 666 131 149 94
N Truthful 404 197 149 97
Total N 1070 328 298 191
Number Scorers 39 16 5 5
N of Deceptive Scores 666 156 174 116
N of Truthful Scores 404 221 173 119
Total Scores 1070 377 347 235
Mean D - -2.126 -2.131 -2.607
StDev D - 3.959 3.801 4.754
Mean T - 3.162 3.412 3.114
StDev T - 3.531 3.153 3.705
Reliability - Kappa - .760 - .750
Reliability - Agreement - .806 .840 .965
Reliability - Correlation - - - .940
Unweighted Average .823 (.041) .844 (.039) .858 (.037) .817 (.042)
Accuracy {.744 to .903} {.768 to .920} {.786 to .930} {.734 to .900}
Unweighted Average .001 (.001) .088 (.028) .090 (.026) .197 (.030)
Inconclusives {.001 to .001} {.034 to .142} {.039 to .142} {.138 to .255}
.815 (.048) .748 (.062) .809 (.069) .783 (.058)
Sensitivity
{.721 to .910} {.626 to .869} {.674 to .945} {.669 to .896}
.832 (.067) .792 (.060) .751 (.031) .538 (.068)
Specificity
{.700 to .963} {.674 to .909} {.691 to .811} {.405 to .672}
.185 (.048) .156 (.050) .112 (.057) .079 (.050)
FN Errors
{.090 to .279} {.058 to .255} {.001 to .224} {.001 to .177}
.168 (.067) .127 (.052) .146 (.027) .203 (.057)
FP Errors
{.037 to .300} {.026 to .229} {.093 to .2} {.090 to .315}
.001 (.001) .096 (.041) .078 (.052) .137 (.033)
D INC
{.001 to .001} {.016 to .175} {.001 to .180} {.071 to .202}
.001 (.001) .081 (.037) .102 (.014) .257 (.049)
T INC
{.001 to .001} {.008 to .153} {.075 to .130} {.160 to .354}
.889 (.037) .806 (.055) .848 (.041) .79 (.059)
PPV
{.816 to .961} {.698 to .914} {.767 to .928} {.675 to .905}
.732 (.076) .878 (.054) .870 (.052) .874 (.062)
NPV
{.583 to .881} {.772 to .983} {.768 to .971} {.753 to .996}
.815 (.048) .827 (.055) .878 (.067) .908 (.053)
D Correct
{.721 to .910} {.719 to .935} {.746 to .999} {.804 to .999}
.832 (.067) .861 (.055) .837 (.027) .726 (.066)
T Correct
{.700 to .963} {.753 to .969} {.783 to .891} {.597 to .856}

Polygraph, 2011, 40(4) 244


Ad Hoc Committee on Validated Techniques

Figure 21 shows a mean and indicating that these four investigative


confidence interval plot for correct decisions techniques differed in their abilities to
and inconclusive results of the five paired correctly classify deceptive and truthful cases.
testing techniques. A two-by-four way However, post-hoc one-way ANOVAs showed
ANOVA, criterion status x technique, for no significant differences in the rate of correct
correct decisions showed a significant decisions for criterion deceptive or criterion
interaction [F (1,2021) = 1320.745, (p <.001)], truthful cases.

Figure 21. Means and confidence intervals for investigative techniques.

A two-by-three way ANOVA, criterion interpreted with decision rules based on an


status x technique, for inconclusive decisions assumption of non-independence. Scores
showed a significant interaction [F (1,953) = were summarized from 14 surveys and
404.177, (p < .001)], indicating that these experiments involving PDD techniques that
three investigative techniques produced are interpreted with an assumption of
different rates of inconclusive results for independent criterion variance among the
deceptive and truthful cases. Post-hoc one- RQs. These studies included 31 scorers who
way ANOVAs showed that differences in provided 1,194 numerical scores for 1,008
inconclusive rates were not significant for confirmed exams, including 562 numerical
deceptive cases, but were significant for scores for 468 confirmed deceptive examina-
truthful cases [F (2,478) = 3.418, (p = .034)]. tions and 632 numerical scores for 540
CIT/GKT results do not include an confirmed truthful examinations. Excluding
inconclusive category, and this technique was outlier results, scores from 24 surveys and
not included in the two-way analysis for experiments were summarized to describe the
inconclusive results. criterion validity of PDD techniques for which
the results are interpreted with decision rules
Independent and non-independent PDD based on an assumption of non-independence
techniques of the criterion variance of the RQs. These
Table 7 shows the criterion accuracy studies included 210 scorers who provided
profile of four PDD techniques that are 8,975 numerical scores for 1,791 confirmed
interpreted with decision rules based on an exams, including 4,612 numerical scores for
assumption of independent criterion variance 933 confirmed deceptive examinations and
among the RQs, along with the criterion 4,363 numerical scores for 858 confirmed
accuracy profile of PDD techniques that are truthful examinations.

245 Polygraph, 2011, 40(4)


Validated Techniques

Excluding outlier results, comparison variance of multiple relevant questions


question techniques intended for event- produced an aggregated decision accuracy
specific (single issue) diagnostic testing, in rate of .850 (.773 - .926) with a combined
which the criterion variance of multiple inconclusive rate of .125 (.068 - .183). The
relevant questions is assumed to be non- unweighted average of accuracy for
independent, produced an aggregated decision independent and non-independent PDD
accuracy rate of .890 (.829 - .951), with a techniques, excluding outlier results,
combined inconclusive rate of .110 (.047 - produced a decision accuracy level of .869
.173). Comparison question PDD techniques (.798 - .940) with an inconclusive rate of .128
designed to be interpreted with the (.068 - .187), as shown in Table 7.
assumption of independence of the criterion

Table 7. Criterion accuracy profile of independent and non-independent PDD techniques.


Non-independent
Criterion Independent Non-independent
Techniques
PDD Techniques PDD Techniques
with Outlier Results
Number of Techniques 4 7 9
Number of Studies 14 24 30
N Deceptive 468 933 1,202
N Truthful 540 858 1,087
Total N 1,008 1,791 2,289
Number Scorers 31 210 225
N of Deceptive Scores 562 4,612 4,881
N of Truthful Scores 632 4,363 4,592
Total Scores 1,194 8,975 9,473
.850 (.039) .890 (.031) .896 (.030)
Percent Correct
{.773 to .926} {.829 to .951} {.837 to .955}
.125 (.029) .110 (.032) .106 (.031)
Inconclusive
{.068 to .183} {.047 to .173} {.044 to .167}
.771 (.072) .833 (.052) .840 (.050)
Sensitivity
{.630 to .911} {.731 to .934} {.743 to .938}
.719 (.047) .765 (.061) .775 (.059)
Specificity
{.626 to .811} {.646 to .884} {.658 to .891}
.113 (.058) .078 (.033) .074 (.032)
FN Errors
{.001 to .226} {.013 to .143} {.011 to .138}
.144 (.039) .115 (.042) .109 (.041)
FP Errors
{.066 to .221} {.032 to .197} {.029 to .189}
.112 (.051) .093 (.041) .089 (.039)
D Inc
{.013 to .212} {.012 to .174} {.011 to .166}
.136 (.031) .127 (.049) .122 (.049)
T Inc
{.076 to .196} {.030 to .223} {.027 to .218}
.828 (.059) .886 (.041) .893 (.039)
PPV
{.712 to .943} {.806 to .967} {.816 to .969}
.878 (.049) .906 (.044) .910 (.043)
NPV
{.782 to .973} {.820 to .993} {.826 to .995}
.873 (.066) .915 (.037) .919 (.036)
D Correct
{.744 to .999} {.842 to .988} {.849 to .989}
.831 (.043) .866 (.049) .873 (.047)
T Correct
{.746 to .915} {.770 to .962} {.780 to .965}

Polygraph, 2011, 40(4) 246


Ad Hoc Committee on Validated Techniques

Figure 22 shows the mean and results, showed a significant interaction for
statistical confidence intervals for correct correct decisions [F (1,10165) = 2656.637, (p <
decisions and inconclusive rates for PDD .001)] and inconclusive results [F (1,10165) =
techniques interpreted with decision rules 806.839, (p < .001)]. However, post-hoc
based on assumptions of independent and ANOVAs showed there was no significant one-
non-independent criterion variance, excluding way difference in the proportion of correct
outlier results. Two-way ANOVAs, criterion decisions or inconclusive results for criterion
state x independence, excluding outlier deceptive cases or criterion truthful cases.

Figure 22. Mean and confidence interval plot for criterion independent and non-
independent (excluding outlier results) PDD techniques.

Discussion MQTZCT, the Utah ZCT (including PLT, DLT,


and CPC-RCMP variants) scored with the Utah
Fourteen PDD techniques (shown in numerical scoring system, and any variant of
Figure 17) meet the requirements of the APA an event-specific three question ZCT scored
2012 standards for test validation. These with the ESS. Statistical analysis revealed
techniques are supported by published two statistical outliers, the IZCT and the
descriptions of the protocol for test MQTZCT. Two-way ANOVAs indicated that
administration and test data analysis, using there were no significant differences between
instrumentation representative of that used in the other evidentiary techniques when the
field practice, and by published and replicated outlier results were not included.
empirical support for the criterion accuracy of
a published method for test data analysis. Five other PDD techniques were found
to produce criterion accuracy that meets the
Five PDD techniques have published APA 2012 standard requirements for paired-
evidence of validity that meets the APA 2012 testing, with unweighted decision accuracy
requirements for evidentiary testing, including over .860 along with inconclusive rates under
unweighted decision accuracy over .900 along .200. These PDD techniques are, in
with inconclusive rates under .200. These five alphabetical order, the AFMGQT when scored
PDD techniques are, in alphabetical order; the with the ESS, the Backster You-Phase
Federal You-Phase technique scored with the technique scored with the Backster numerical
Empirical Scoring System (ESS), the IZCT, the scoring system, the Federal You-Phase

247 Polygraph, 2011, 40(4)


Validated Techniques

technique scored with the Federal seven- techniques has accuracy different from the
position TDA model, and the Federal ZCT other techniques.
scored with the Federal seven-position TDA
model, and the Federal ZCT scored with the Outlier results
seven-position TDA model and interpreted Two outliers were identified: the IZCT
with evidentiary decision rules. Although this and the MQTZCT. Research for both of these
level of validation is intended to serve the techniques reported near-perfect accuracy,
needs for criterion accuracy in paired-testing and these results were found to be statistical
situations, the majority of PDD examinations outliers to the distribution of results predicted
are not intended for paired testing or by all other studies on all other techniques,
evidentiary use in courtroom settings. It is including the other evidentiary techniques in
therefore inevitable that many field which these two studies are grouped. These
examinations may be conducted with the PDD two techniques rely on support from the most
techniques in this list though not intended for problematic research of all studies included in
use in courtroom settings. Although a the meta-analysis.
significant interaction was observed between
criterion status and PDD techniques, One of the two studies included in
indicating that different PDD techniques may support of the IZCT (Gordon et al., 2005; also
provide subtle differences in accuracy with described by Mohamed et al., 2006) is a very
criterion deceptive and criterion truthful small study described in one publication as a
cases, none of the one-way main effects was non-blind pilot study. The use of pilot studies
significant for decision accuracy or to answer questions about criterion accuracy
inconclusive results among the deceptive or is troublesome. Additionally, both reports
truthful cases. The present evidence does not indicated that one of the 12 participants in
support a conclusion that any of these the Gordon et al. (2005) study, a programmed
techniques provides accuracy that is different innocent participant, made a false-confession
from the other techniques, and instead to the examiner, also the primary author,
suggests this group of PDD techniques during the pre-test interview. That participant
provides overall criterion accuracy of similar was removed from the experiment, which
effectiveness. illuminates the non-blind study design. A
false confession in field PDD programs would
Four additional PDD techniques were not be immediately distinguishable from an
found to satisfy the APA 2012 standard authentic confession. In field polygraph
requirements for investigative testing. These programs a pre-test confession would be
four PDD techniques are, in alphabetical viewed as a practical and successful form of
order, the CIT/GKT, the DLST/TES scored resolution of the matter under investigation.
with the seven-position TDA method, the Authentic confessions are regarded as PDD
DLST/TES scored with the ESS, and the successes, and it is therefore necessary to
AFMGQT50 when scored with the seven- regard false-confessions as problems. In a
position TDA method. Although there may be field situation, it would only be later, when
subtle differences in the accuracy of these additional evidence is available, that the
techniques with criterion deceptive and confession would be identified as an error and
criterion truthful cases, there were no would be viewed as problematic.
significant main effect differences for decision
accuracy or inconclusive results among the Inclusion of this error into the study
deceptive or truthful cases. These results results would have resulted in a false-positive
suggest that this group of PDD techniques (i.e., false-confession) rate of .167 and less
provides overall criterion accuracy of similar than perfect test accuracy. Instead, the
effectiveness, and the present evidence does results from the Gordon et al. (2005) study
not support a conclusion that any of these were provided without the false confession,

50 Because the Utah MGQT and the LEPET are structurally virtually identical to the AFMGQT, and use the same

scoring regimen, it is reasonable to generalize the AFMGQT validation findings to these two techniques.

Polygraph, 2011, 40(4) 248


Ad Hoc Committee on Validated Techniques

along with a reported decision accuracy rate of calculations of interrater reliability could not
1.000. It is possible that neither the reported be completed from the available data. The
decision accuracy rate of 1.000 nor the false absence of reliability statistics does not allow
confession rate of .167 is representative of estimates of the generalizability of the study
IZCT performance in field settings. An results to results that would be obtained from
argument could be offered that since this was other examiners or other scorers. Coupled
a non-blind pilot study, which was not with the significant interaction effects between
designed to serve as a criterion accuracy sampling distribution and case status, the
study, removing errors from the reported present evidence is insufficient to support the
study result was justified. Pilot studies like notion that other practitioners would obtain
this help guide decisions about the funding scores or results similar to those reported in
and design of more rigorous research into the published studies.
areas such as fMRI or other methods for lie
detection. However, the selective exclusion of Studies supporting the generalizability
unfavorable data from a study of criterion of the MQTZCT, the other statistical outlier,
accuracy requires strong justification. are limited by some interesting and unique
factors. First, the developer of the MQTZCT
An additional concern regarding the and previous authors themselves seem to
evidence supporting the IZCT is the fact that have cautioned against the generalizability of
the sampling distributions from the three this technique by emphasizing the need for
included studies differ significantly. intensive and specialized training available
Significant differences are the result of several only from practitioners of the method. Indeed
possible conditions, including: 1) the samples they have asserted that the complexity of the
were selected from different populations, 2) technique, and its related psychological
the IZCT was administered differently to the hypotheses, are such that other trained PDD
different study samples, or 3) the study examiners should not reasonably expect to
samples were scored and interpreted with a learn or properly execute the MQTZCT based
different application of the TDA rules. It is on information available in the published
also possible that the observed significant sources. An emphasis on strict compliance
differences are the result of a highly selective with a complex and precise systems of many
sampling methodology, in which examinations rules gives the impression that the technique
are included based on the examiner's or should be regarded as fragile, non-robust, and
investigators judgment of good or confident easily disrupted by even slight departures
results, such as would occur in the context of from stipulated procedures.
a direct admission regarding the investigation
targets. Over-reliance on confession A second, equally important concern
confirmation could have the effect of involves the fact that a significant interaction
systematically excluding both false-negative was found between sampling distribution and
and false-positive error cases, for which no case status. Although one-way differences
confession would be likely to be obtained. were not significant within the deceptive or
truthful groups, the significant interaction
Regardless of the reason, deceptive effect indicates that the scores of criterion
and truthful scores were expressed in deceptive and criterion truthful cases are
significantly different ways in the three expressed or interpreted in different ways
different studies on the IZCT. The meaning of within the sampling distributions of the three
these significant differences to this meta- included studies on the MQTZCT. In other
analysis is that the included studies appear to words, the data are not congruent even among
be based on samples that are not the studies used to support the MQTZCT.
representative of each other, and it is This significant interaction suggests the
unknown whether one or more of the studies possibility that the included studies are based
is not representative of the population of on samples that are not representative of each
examinees. other. It is unknown whether one or more of
the studies is not representative of the
A third problematic concern with the population of all examinees, reducing our
IZCT is that none of the published studies confidence in the potential for generalizability
included any reliability statistics and of the reported results.

249 Polygraph, 2011, 40(4)


Validated Techniques

A third concern involving the MQTZCT insufficient to support the generalizability of


is that the reported reliability coefficient of the reported study results.
.990 was published in the Matte and Reuss
(1989) reprint of the dissertation published in Possible mediators of these outlier
the journal Polygraph, but cannot be located results include the possibility that these
in the original dissertation study for the no techniques are simply superior to others. The
longer extant Columbia Pacific University. role of proprietary, personal and financial
This is both unfortunate and concerning interests, including business relationships
because the unprecedented high rate of inter- between technique developers and principal
scorer agreement is unexpected given the investigators, cannot be overlooked, however,
purported complexity of the method. and the serious methodological and empirical
confounds surrounding the supporting
A final confound to the generalizability research undermines confidence in the study
of the results of the included studies on the results and reported accuracy of these
MQTZCT is that the data provided to the techniques. From a scientific perspective,
committee initially included numerical scores even well designed research generated by
for only those cases for which the scorers advocates of a method who have a vested
achieved the correct result. Data available to interest in the outcome, and who act as
the ad-hoc committee did not initially include participants and authors of the study report,
numerical scores for those cases for which the does not have the compelling power of
scorers achieved erroneous or inconclusive research not so encumbered by these
results. Missing scores were later provided to potentially compromising factors.51
the committee for both the Mangan, Armitage
and Adams (2008) and Shurani, Stein and Regardless of what factors contribute
Brand (2009) studies. However, the resulting to these exceptional results, the confounds
sampling distribution means, calculated with associated with the supporting studies
the missing scores, were different from those undermines confidence that they represent
reported for both studies. Because of these the true accuracies. Expectations that these
discrepancies, the statistical analysis was not outlier results will generalize to field settings
re-calculated with the missing scores, and this should be delayed until more complete
analysis reflects the mean scores as reported independent replication studies and extended
by Mangan, Armitage and Adams (2008), and analysis are completed.
by Shurani and Chavez (2009). Field data are
always a combination of diagnostic (i.e., Criterion accuracy
controlled or explained) variance and error Excluding outliers, the aggregated
variance (i.e., uncontrolled or unexplained unweighted accuracy52 of all PDD techniques
variance). The sampling means reported in was .869 (.798 - .940), with an unweighted
the Mangan, Armitage and Adams (2008) and inconclusive rate of .128 (.068 - .187). All 14
Shurani, Stein and Brand (2009) studies are PDD techniques included in this meta-
systematically devoid of error variance. Given analysis produced unweighted decision
that a significant interaction effect was accuracy levels that were significantly greater
observed between sampling distribution and than chance. Excluding outliers, there were
case status, the present evidence is no significant one-way differences in the

51Questions may arise as to why these studies and techniques were included in the meta-analysis after identifying
so many serious confounds. The techniques were ultimately included in the meta-analysis because they met the
more general requirements outlined in the APA Standards of Practice. It was also determined that the meta-analysis
is more complete, and therefore more helpful and informative to interested readers, with the inclusion of these
studies and techniques.

52The unweighted average was considered to be a more conservative and realistic calculation of the overall accuracy
of all PDD examination techniques. Calculation of the weighted average, or the simple proportion of correct
decisions, often results in higher statistical findings that are less robust against differences in base-rates and
therefore less generalizable.

Polygraph, 2011, 40(4) 250


Ad Hoc Committee on Validated Techniques

unweighted decision accuracy of any of the 14 technique scored with the Federal seven-
PDD techniques, and no significant one-way position TDA model, the Federal You-Phase
differences in correct decisions, inconclusive technique scored with the ESS, the Federal
results, or errors for criterion deceptive or ZCT scored with the Federal seven-position
criterion truthful cases. Neither were there TDA model, the Federal ZCT scored with the
any significant differences in the aggregated seven-position TDA model and interpreted
criterion accuracy of PDD techniques at the with evidentiary decision rules, the Utah ZCT
evidentiary, paired-testing, and investigative (including PLT, DLT, and CPC-RCMP variants)
levels. Some practical differences were scored with the Utah numerical scoring
observed in the criterion accuracy profiles of system, and any variant of an event-specific
these techniques. All five techniques included three-question ZCT scored with the ESS.
at the evidentiary level produced statistically
significant effect sizes for both test sensitivity Published and replicated empirical
to deception and test specificity to truth- evidence exists for four PDD techniques that
telling. are interpreted with decision rules based on
the assumption of independent criterion
At the paired testing level all five variance among the RQs. These techniques
techniques also produced test sensitivity to produced an aggregated unweighted accuracy
deception that was significantly greater than level of .850 (.773 - .926) with an inconclusive
chance, though only two of these techniques, rate of .125 (.068 - .183). In alphabetical
the Federal ZCT scored with the seven- order, these techniques are: the DLST/TES
position evidentiary rules, and the AFMGQT scored with the seven-position TDA method,
scored with the ESS, produced test specificity the DLST/TES scored with the ESS, the
to truth-telling that was significantly greater AFMGQT when scored with the seven-position
than chance. Specificity to truth-telling was TDA method, and the AFMGQT when scored
not significantly greater than chance for the with the ESS.
Backster You-Phase technique with Backster
scoring, Federal ZCT with seven-position Despite these observed and practical
scoring or Federal You-Phase with seven- differences, excluding outlier results, no
position scoring. significant differences were found for decision
accuracy or inconclusive results among the
For investigative techniques, all four PDD techniques that satisfy the requirements
techniques produced test sensitivity to of the APA 2012 standards. Similarly no
deception that was significantly greater than significant differences were found for decision
chance. Specificity to truth-telling was accuracy or inconclusive results for PDD
significantly greater than chance for the CIT, techniques interpreted with the assumption of
and the DLST/TES format when scored with independence or non-independence among
both the seven-position and ESS models, but the RQs.
was not significantly greater than chance for
the AFMGQT when scored with the seven- Not all techniques reviewed possessed
position model. sufficient empirical support to meet the APA
standards for inclusion. Some named PDD
Excluding outlier results, published techniques were found to lack any published
and replicated empirical evidence for seven evidence of support that could be used to
CQT formats, intended for event-specific calculate the sampling distributions, reliability
diagnostic testing for which the results are and criterion accuracy profiles needed for
interpreted using decision rules based on the inclusion in this meta-analysis; these are
assumption of non-independence of the listed in Appendix F. Appendix G provides a
criterion variance of the RQs produced an summary of published studies that could not
aggregated unweighted accuracy rate of .890 be included in the meta-analysis. Appendix H
(.829 - .951) along with an inconclusive rate of contains a description of techniques for which
.110 (.047 - .173). These techniques are, in there exists a single un-replicated study that
alphabetical order, the AFMGQT when scored met the requirements for inclusion in this
with the ESS, the Backster You-Phase meta-analysis. These techniques could not be
technique scored with the Backster numerical included in the meta-analysis as the APA
scoring system, the Federal You-Phase Standard requires a minimum of two

251 Polygraph, 2011, 40(4)


Validated Techniques

published studies. Appendix I lists those PDD of these PDD techniques are structurally
techniques found to have published and nearly identical to the AFMGQT. We can find
replicated evidence of support, but the no reason why validation data for AFMGQT
reported criterion accuracy did not satisfy the cannot be generalized to these techniques if
validity requirements of the APA 2012 scored with the same TDA methods.
standards.
Comparison With Previous Systematic
PDD techniques that make use of the Reviews
three-position TDA model are not included in We did not test the level of significance
the meta-analysis and are therefore not of the difference between the present accuracy
included in Table 1. The criterion accuracy estimations with those of the OTA (1983),
profiles of PDD techniques that make use of though it can be easily seen that the .847
the three-position TDA model are shown in mean accuracy rate of field studies is outside
Appendix I. The unweighted decision the 95% confidence interval (.865 to .977) for
accuracies were significant for all of the PDD techniques that meet the APA 2012
techniques based on three-position TDA requirements for evidentiary testing. This
methods, but not equal for the deceptive and observed difference is most likely due to the
truthful cases. All techniques that employed exclusion of results from studies that do not
three-position TDA methods consistently conform to recognizable field practices, and to
exceeded the 2012 limit for inconclusive the exclusion of results of PDD techniques
decisions (20%). Because criterion accuracy that do not produce satisfactory results
rates for techniques with three-position TDA according to the APA 2012 standards of
did not differ significantly from seven-position practice. There is little justification for use in
criterion accuracy, an initial analysis with the field practice, and therefore little justification
three-position TDA method may be considered for inclusion into accuracy estimations, of
acceptable if inconclusive results are resolved PDD techniques that have been supplanted by
via subsequent analysis with a TDA method more effective methods. Inclusion of arcane or
that provides both accuracy and inconclusive substandard methods into accuracy
rates that meet the requirements of the APA estimation would be the equivalent of
2012 standards. attempting to answer an automobile industry
question regarding corporate fuel economy
Some readers will note that two while including all makes and models from
versions exist for the AFMGQT with minor the 1960s and 1970s gas-guzzling era into
structural differences between them.53 There calculations of present-day economy.
is no evidence to suggest that the performance Techniques which produce substandard and
of one version is superior to the other. Con- unsatisfactory criterion accuracy were
sidering that rigorous and replicated evidence therefore excluded from the meta-analysis.
would be required to reject a null-hypothesis
that the differences are meaningless, and These results are consistent with the
considering that the included studies include results of Honts and Peterson (1997), Raskin
a mixture of both AFMGQT versions, these and Podlesny (1979), Abrams (1977; 1989),
results are provided as generalizable to both and Ansley's (1990) findings regarding blind
versions of the AFMGQT. evaluation of PDD test data. Results of this
meta-analysis are also consistent with the
Two widely used and recognizable results of the more recent National Research
techniques, the LEPET and the Utah MGQT Council (2003) who reported an accuracy rate
(four-question version of the Utah technique), of laboratory studies as .860 along with an
were not included in the meta-analysis aggregated rate of .890 for field studies, using
because no published studies could be located studies that met their selection criteria.
in support of these techniques. However, both Because the present analysis includes only

53 The AFMGQT is used in both multi-facet investigations of known incidents and multi-issue screening contexts.

Both types of exams, muti-facet and multi-issue, are interpreted with decision rules based on the assumption of
independent criterion variance among the RQs.

Polygraph, 2011, 40(4) 252


Ad Hoc Committee on Validated Techniques

techniques as they are documented and used confession; all field samples that are selected
in field settings, we suggest that the present through the availability and quality of
results provide a more helpful and practical confirmation data are potentially non-random
answer to PDD professionals, program and non-representative. This same concern,
managers, and professional consumers of regarding the non-independence of
PDD results who are faced with the need to confirmation data, applies to investigation
make evidence-based decisions regarding the results and judicial outcomes that are based
selection and field use of presently available in part on the information resulting from a
PDD techniques. polygraph exam. It is possible that this
phenomenon underlies the general trend in
These results are more conservative the literature in which the results of PDD field
than those reported by Ansley (1983; 1990) studies have generally outperformed the
and those of Abrams (1973), which warrants results of laboratory experiments.55 Despite
further discussion. It is unlikely that the PDD potential or observed sampling differences
test has become less accurate during the last between field and laboratory studies, the NRC
three decades. A more realistic possibility is (2003) found no significant differences
that the samples included in the early between the results of high quality field and
literature reviews by Ansley were more laboratory studies.
vulnerable to overestimation of test accuracy
as a result of sample selection methodology. Results of this meta-analysis are
Ansley (1990) stated that court decisions and consistent with the systematic review of
evidence are sometimes unreliable, and Crewson (2003) regarding the accuracy of
expressed a preference for confession diagnostic polygraphs. However, these results
confirmation of PDD examination results. depart from Crewson's conclusion regarding
Over-emphasis on confession confirmation screening polygraphs. The accuracy rate
includes the potential for unintended found for screening polygraphs in this meta-
systematic exclusion of false-negative and analysis was higher than that reported by
false-positive errors, both of which conditions Crewson, and the difference is statistically
are unlikely to lead to a confession, and significant (t [1008] = .002). While the exact
therefore the confirmation criterion. cause of this difference cannot be known from
Confessions themselves are the result of a the present data, we note that the studies and
non-random decision to pursue further techniques used by Crewson could not be
discussion with and disclosure from the included in this meta-analysis. Four of the
examinee. If the decision to pursue a screening studies reported by Crewson
confession is based in part on the results of a involved the Relevant-Irrelevant technique
polygraph exam, then confirmation via (Ansley, 1989; Brownlie, Johnson & Knill,
confession is non-independent from the test 1997; Honts & Amato, 1999; Jayne, 1989),
result and therefore self-fulfilling. and the remaining study involved the Reid
Technique. Included studies pertaining to
The impact of these methodological criterion independent screening polygraphs
issues could be sampling distributions that were not available at the time of Crewson's
inflate PDD test accuracy.54 This phenomenon review of the published scientific literature.
may not be limited to confirmation via

54 A possible example of this phenomenon can be seen in Mangan et al., (2008) who reported the results of a survey
of the confession-confirmed test results of one experienced examiner. The reported results were 100% accurate, a
finding in accord with what would be expected to arise from a confession-based selection bias.

55 An alternative explanation would hold that the difference is the result of differences in ecological and external

validity of the test circumstances. These hypotheses have not been thoroughly evaluated and it would be unwise to
attempt to reach any conclusion with the current state of understanding. The NRC (2003) reported that this trend is
not inconsistent with experience in other fields of testing and science and should be the focus of future research.

253 Polygraph, 2011, 40(4)


Validated Techniques

Moderators and Mediators types of exams. This would seem to suggest


There was no indication that the study that the selection of different examination
results were a function of, or influenced by, strategies, involving independent or non-
sample sizes. Results were not coded for independent RQs, is a practical matter that
examinee characteristics, including age, should be determined by the needs of the
gender, ethnicity, culture, education, or socio- testing circumstances.
economic status, nor were the studies coded
for their quality or methodology. Sample Ancillary Analysis
results based on examinees who were subject One ancillary analysis was completed.
to some form of experimental manipulation Results were calculated for CQT formats with
(e.g., medications, fatigue, chronic physical or the exclusion of those studies that did not
chronic mental health problems, level of satisfy a more rigorous set of selection criteria.
functioning, countermeasure training or First, PDD techniques were excluded from the
instructions, etc.) were not included, and ancillary analysis if both test sensitivity to
these factors were not evaluated. deception and test specificity to truth-telling
were not both statistically significantly greater
Excluding outliers, no significant than chance. This resulted in the exclusion of
differences were found in the criterion the AFMGQT, Federal ZCT, and Federal You-
accuracy of PDD techniques suitable for Phase techniques when these are scored with
evidentiary testing, paired testing, and the seven-position TDA model, in addition to
investigative testing. This suggests that these the Backster You-Phase technique. Statistical
categorical distinctions are arbitrary and outliers, not accounted for by the available
therefore meaningless in a scientific sense. evidence, were also excluded. This resulted in
However, the value of standardized the exclusion of the IZCT and MQTZCT and
requirements for test precision becomes several studies that were seriously
clearer when considering policy decisions that confounded. Exclusion of techniques for
emphasize or require the use of evidence- which there is no published statistics
based methods and restrict the use of un- describing test reliability also resulted in the
validated or experimental methods. The exclusion of the IZCT. Similarly, PDD
scientific value of categorical distinctions techniques and studies were excluded if there
becomes more obvious when considering the were significant interaction or main effect
difficulty in answering questions about the differences between the sampling
scientific accuracy, and the complications that distributions, indicating that the sample
result from the inclusion into accuracy distributions are not representative of each
estimates of less accurate and arcane other. This also resulted in the exclusion of
methods that have been supplanted or the IZCT and MQTZCT. Studies were also
replaced by more effect modern alternatives. excluded if statistical descriptions of the
Ethically, it is difficult to imagine some sampling distributions were not available or
justification, when decisions affect individual could not be calculated from the available
lives, community safety and national security, data. This resulted in the removal of two
for the use of methods which the scientific studies on the DLST (Research Division Staff,
evidence has shown to be sub-optimal or sub- 1995a; 1995b), one study on the AFMGQT
standard. (Senter, Waller & Krapohl, 2008), and two
studies on the MQTZCT (Shurani, Stein &
Comparison of accuracy rates for PDD Brand, 2009; Shurani, 2011).
techniques interpreted with the assumption of
criterion independence versus non- CQT formats retained for ancillary
independence showed no significant analysis produced a combined decision
differences in decision accuracy. However, a accuracy rate of .898 (.840 - .955) and an
significant interaction effect for inconclusive inconclusive rate of .092 (.033 - .150)56 for
results suggests there may be subtle PDD techniques interpreted with decision
differences in inconclusive rates for these rules based on an assumption of

56 Calculated as the weighted average of unweighted decision accuracy and the unweighted inconclusive rate.

Polygraph, 2011, 40(4) 254


Ad Hoc Committee on Validated Techniques

non- independence of the criterion variance of the DLST/TES and AFMGQT PDD were
the RQs. PDD techniques interpreted with achieved using decision rules that are based
decision rules based on an assumption of on an assumption of criterion independence
independent criterion variance produced a among the RQs. Generalizability of the results
decision accuracy rate of .857 (.782 - .932) of this meta-analysis may depend, in part, on
and an inconclusive rate of .117 (.058 - .177). the correctness of this assumption.
The aggregated decision accuracy rate for all
studies and PDD techniques included in the Some of the included studies are
ancillary analysis was .883 (.817 - .950) with impaired by obvious research confounds, the
an inconclusive rate of .116 (.056 - .175).57 most noticeable of which is that some samples
Two-way ANOVAs showed that neither main were selected with an emphasis on examinee
effect nor interactions were significant when confession as a central feature of the criterion.
comparing the decision accuracy for the Another important confound, observed in
ancillary analysis with that of the entire meta- some of the included studies, was that the
analysis. The interaction effect was primary author was also the developer of a
significant for inconclusive results [F (1,1992) PDD technique for which there exists some
= 17.335, (p < .001)]. Inconclusive rates were form of proprietary, or financial interest.
slightly higher for truthful cases with non- Indeed, it would appear that one of the
independent techniques, and slightly higher markers for these kinds of studies (and typical
for deceptive cases with criterion independent of advocacy research elsewhere) is that the
PDD techniques. Post-hoc ANOVAs showed reported near-perfect accuracy demonstra-
that none of the one-way differences were tions are statistical outliers to the distribution
significant, indicating that these small of results from less confounded studies.
differences are unlikely to be noticed by field
examiners. The absence of critical information and
critical commentary in some included study
One-way ANOVAs showed that the reports gives the impression of a file-drawer
results of the ancillary analysis did not differ bias in which less than favorable results are
significantly from the results of the entire not submitted for publication. Another ver-
meta-analysis for correct decisions [F (1,5471) sion of this problem seems to have occurred in
= 0.08, (p = 0.777)] or inconclusive results [F the context of this meta-analysis, in which
(1,5471) = 0.08, (p = 0.777)]. This indicates some of the study data initially provided to the
that the use of more rigorous study selection committee, and some of the published
requirements would be unlikely to produce sampling means, included only those results
meta-analytic results that differ from the for which the scorers achieved the correct
results of this study. results, initially withholding the results of
inconclusive and error cases. The result of
Limitations this is that published sampling means for
Two obvious limitations pertain to this some studies are systematically devoid of
analysis. First, studies were not coded for error or uncontrolled variance and must
field and laboratory studies and no attempt therefore be considered not generalizable.
was made to investigate any effects from
differences in study design. Instead, field and Confounds related to individual
laboratory results were included with equal studies can complicate the meaning and
consideration and the results of all studies interpretation of the results of the meta-
were combined regardless of design. analysis. These concerns represent an
Secondly, there was no attempt to investigate example of the value and need for scientific
decision accuracy at the level of the individual rigor and independence when evaluating the
questions for any of the included PDD effectiveness of PDD and lie detection
techniques. Related to this second confound methods. Study selection and inclusion rules
is the fact that the results of studies involving for the meta-analysis were intended to be as

57 Calculated as the unweighted average of all studies included in the ancillary analysis.

255 Polygraph, 2011, 40(4)


Validated Techniques

inclusive as possible, yet maintain a level of Another limitation of this analysis is


scientific rigor. To reduce the impact of these that none of the included studies involved
confounds on the meta-analysis, aggregated juvenile examinees. As a result, a
results have been provided both with and conservative evaluation of the present results
without outlier results. would suggest that our present knowledge-
base can be considered applicable only to
Another confounding issue with some physically and mentally healthy adults of
studies is that the level of education, training normal functional characteristics. A more
and knowledge regarding psychological, generous interpretation would recognize that
physiological and testing principles may be there is little difference between adults and
significantly greater for participating older juveniles in terms of physiology as
examiners than for most field examiners. measured or utilized by modern polygraph
Most blind scoring studies of PDD accuracy sensors and little difference in the
involve highly experienced experts. Studies of psychological bases for polygraph reactions
the ESS have been an exception to this trend, between adults and older, developmentally
making use of inexperienced examiners and mature, juveniles. Generalizing these results
scorers, and recent studies by Honts and his to persons who are known outliers compared
colleagues have involved students trained to to the expected distribution of persons from
collect the study data. the normative population (i.e., persons whose
functional characteristics are outside the
Meta-analysis always involves the normal range) should be done with great
imposition of study selection rules, and it is caution.
always possible that a meta-analysis based on
a different set of inclusion criteria would lead Some of the included studies lacked
to different results. All studies in this complete information, though it was possible
analysis were regarded equally if they met the to calculate the reliability, sampling
publication requirements and provided distributions, and the dimensional profile of
sufficient information to evaluate the criterion criterion accuracy from raw data that was
accuracy and reliability or generalizability of provided to the ad hoc committee. Sample
the study results. Qualitative requirements data were not available for some studies,
for inclusion in this study pertained only to notably those from the U.S. Government.
whether included studies satisfactorily Those studies include the development and
represented field testing instrumentation and validation studies on the TES (DLST) and the
components, and satisfactorily represented a AFMGQT techniques. These studies did
PDD technique for which a published report reliability and accuracy data that was
description exists for the test question sufficient to include them in the meta-
sequence and method for test data analysis. analysis, and all of these studies have been
Meta-analytic weighting values were assigned replicated independently.
according to the sample size and number of
scorers for each study, though there were no An obvious limitation of this meta-
obvious effects related to sample size. analysis is that it did not include the results
Although previous statistical analyses have of computerized scoring algorithms.
not identified any significant differences in the
results of field and laboratory studies, it is One other issue deserves mention.
possible that meta-analytic results would be The principal investigator for this meta-
slightly different if the included studies were analysis was also the primary author of a
coded and weighted for other dimensions, number of included studies.58 The committee
including study quality, design, sampling was aware that his research was significantly,
methodology, proprietary interests, or and at times solely, involved in studies that
inclusion of the primary author as a study were proffered to validate some of the included
participant. techniques. Some of these techniques

58Mr. Nelson has no financial, proprietary or personal interest in any of the PDD techniques or methodologies
included in this meta-analysis.

Polygraph, 2011, 40(4) 256


Ad Hoc Committee on Validated Techniques

would not have been included without these One practical area of needed research
studies. As such, it was the responsibility of involves the generalizability of normative data
the committee to weigh his judgments against and accuracy estimates for PDD techniques
factors that may have diminished his interpreted with the assumption of criterion
independence. This was of vital concern, independence, including both multi-facet and
inasmuch as a bias in study selection or data multi-issue exams. Another practical area of
analysis would seriously compromise the needed research involves the use of DLCs with
integrity of the final report. Upon closer additional PDD test formats.
scrutiny, reservations regarding these two
potentially conflicting roles (principal author Continued research and improvement
of the meta-analysis and author of studies is needed for all PDD techniques, and these
included in the meta-analysis) were mitigated improvements should be fully integrated into
by the lack of any apparent personal interests both training and field practices. Additional
in the outcomes of those studies and his studies should be completed to increase the
limited participatory role in any study (never knowledge base regarding moderator variables
having conducted or scored any of the such as examinee characteristics (e.g.,
examinations). His published or pending juveniles, older persons, persons with mental
studies did not reveal any discernible pattern illness, and persons with medical health
of preference for or against particular complications) in addition to crime details or
polygraph techniques. Finally, all decisions characteristics that lead to the most effective
for study inclusion were made collectively use of PDD examinations. Additional research
among the committee members based on the is needed in screening examinations,
merits of the research: no single committee including studies pertaining to the decision
member had absolute authority to exclude or theoretic complexities inherent to
include a given study. While reasonable examinations constructed with multiple
individual opinions may differ in some parts of independent targets. Researchers should
this report, the committee took deliberate care continue to increase their use of Monte Carlo
to ensure that personal interests among those models and other statistical methods that can
in the committee would not be cause for be used to provide answers to complex
criticism of the report. Full disclosure of the research problems that are difficult to
relationship between the principal author of investigate through other methods. Results of
this report and his research is provided here Monte Carlo studies should be compared to
to meet the standard ethical obligation in those from live experiments from both field
scientific reports. and laboratory settings.

Recommendations A number of mediator variables have


Because no significant differences were been suggested as having a significant effect
found among the 14 PDD techniques included on the accuracy of the PDD exam, and some
in this meta-analysis, no attempt should be of these involve complex psychological and
made to describe these techniques in terms of linguistic assumptions that may or may not
a rank order regarding effectiveness. be fully testable. Untestable hypotheses
Available evidence does not support any PDD should be discarded in favor of testable ones,
technique as superior to others. Attempts at and additional research should be conducted
establishing any hierarchy of efficacy are to understand the merits of procedural and
therefore unwarranted. Instead, less attention structural hypothesis that have been
should be given to named PDD techniques suggested as related to test accuracy.
and meaningless differences in PDD test Evidence from scientific studies should
formats. More emphasis should be given to become a standing expectation, and
test construction details for which there is developers and practitioners of PDD exams
replicated evidence of their contribution to should resist the temptation to include
criterion accuracy. More emphasis should be authority-based and anecdotal theories which
given to the important practical and decision have not been tested.
theoretic differences in PDD techniques for
which the RQs are interpreted as independent Increased research standards are
or non-independent. needed, including requirements for
transparency and statements of interest from

257 Polygraph, 2011, 40(4)


Validated Techniques

all authors and participants. More and statistical decision theory is still not
importantly, because research on PDD test common in TDA methods for PDD exams.
effectiveness is a process of testing the test, Instead, TDA methods in field use emphasize
primary authors should be required to refrain manual scoring methods with integer-level
from also functioning as a study participant. and rank-level precision that should be
It is especially important, when the primary considered blunt and unreliable compared to
author is also the PDD technique developer or the precision and reliability that can be
lacks independence due to a financial or obtained via automated measurement and
business relationship with the developer, that statistical analysis. There is a growing basis
the research data and methodology be of evidence that indicates that computer
subjected to rigorous objective and external algorithms can be equally or more effective
review before a profession or the community is than manual TDA methods as long as the data
encouraged to rely on the research results. are of satisfactory quality. Because PDD
examination results may play a decision
Researchers should be required to support role in matters that affect individual
provide statistical descriptions of the sampling lives, community safety and national security,
distributions. This will facilitate more effective the developers of computer algorithms should
comparison of sampling distributions, and will be required to provide complete descriptions
increase the ability to evaluate and of the operational procedures, in addition to
understand the representativeness and the evaluation criteria, data transformation
generalizability of study results. Just as the and aggregation methodology, normative data,
results of single un-replicated studies are of and statistical basis in decision theory, signal
little actual value to meta-analytic research, detection theory, signal discrimination theory,
the results of studies that employ a single regression analysis or machine learning.
expert scorer are of little actual value to the
profession. Researchers should be Continued development and
encouraged or required to use multiple scorer refinement of PDD testing methodologies is
participants of varied training and experience. needed. PDD testing procedures have
Both examiners and examinees should be changed little over the past decade.
randomly selected whenever possible. This Development efforts during this time have
will increase the ability to study and focused on improvements to test data analytic
understand the generalizability of PDD methods, including decision rules (Senter,
methods. 2003; Senter & Dollins, 2002; 2004; 2008),
statistical algorithm development (Nelson,
Some studies included in the meta- Krapohl & Handler, 2008), numerical
analysis are not adequately identified transformations (Krapohl, 2010; Nelson,
regarding the type of study, and the effect is Krapohl & Handler, 2008; Nelson et al., 2011),
potentially misleading for the profession. Pilot and an increased use of normative data to
studies and surveys should be clearly calculate error rates and optimal decision
identified as such, and should not be included cutscores (Krapohl, 2010; Nelson & Handler,
in future systematic reviews or meta-analysis 2010; Nelson, Krapohl & Handler, 2008;
of criterion accuracy. Criterion studies should Nelson et al., 2011). Additional research and
also be clearly identified from studies designed more detailed investigation and comparison of
to evaluate moderator or mediator variables or numerical transformation models is needed,
questions of construct and causality. including seven-position, three-position, ESS,
rank-order transformation methods, those of
Reliability statistics should be required computer algorithms, and the application of
for all studies, unless precluded by the study these transformations to examinations
design (e.g., computer algorithm or simulation constructed from independent and non-
studies). Primary authors should be required independent examination targets.
to make all raw data and numerical scores
available for review and extended analysis. PDD component sensors have changed
little for several decades. This may be a mixed
The results of computerized statistical blessing. Although critics may point to this as
TDA algorithms should be included in future a stagnation of research and development,
studies of this type. The use of computers there is a considerable published knowledge

Polygraph, 2011, 40(4) 258


Ad Hoc Committee on Validated Techniques

base supporting and describing the inclusion of new and emerging information.
effectiveness of the presently used array of Future meta-analytic studies should code and
PDD component sensors. It would be evaluate for moderators such as examiner
premature to abandon that knowledge base in characteristics, examinee characteristics, and
an attempt to satisfy a collective hunger for mediators such as study quality, financial
new methods. Any replacement of the interests, and other possible mediators.
presently used component sensors must be
accompanied by published and replicated Conclusions
evidence that the data and information Results of this meta-analysis show
provided by the new sensors is as good as, or that a number of studies are of satisfactory
better than, the data and information from the rigorous quality to provide a basis of empirical
presently used sensors. Additionally, the use support describing the generalizability of an
of new and improved sensors will face a array of PDD techniques at criterion accuracy
substantial and non-trivial burden of levels that significantly exceed chance
developing and demonstrating the expectations. Although somewhat arbitrary,
incorporation of new data into new or existing the APA 2012 standards of practice and
normative data and new or existing structural requirements for test accuracy are helpful to
decision models. Despite these general the profession. The goal of professional
cautions about the replacement of PDD standards is to promote the use of effective
components and physiological measures, it is methods, and discourage the use of less
also clear that the PDD test remains imperfect effective and unproven methods. Fourteen
and in need of continued advancement. PDD PDD techniques were found to be supported
test methods will not be improved without by multiple published studies and to satisfy
replacing less effective methods and the requirements of the APA 2012 Standards
techniques with more effective ones. of Practice. Normative data are available for
each of these 14 PDD techniques. We note
To give greater confidence in the that despite the imperfections of the
effectiveness of the IZCT and the MQTZCT (or polygraph, the NRC (2003) reported that none
any proprietary methods59), they should be of the potential new technologies was ready to
subject to replication by independent replace the polygraph, and this condition
researchers, who did not develop the appears not to have changed at the present
techniques, have no business relationship time.
with the developers, did not conduct the
exams, analyze the data, or report the findings APA standards do not themselves
at the time of the exams. All raw data and impose qualitative or methodological
numerical scores should be made available for requirements for scientific evidence,60 and no
extended analysis. If these techniques are quantitative requirements are stated beyond
inherently superior to others, there should be the requirement for two publications that
no great difficulty in confirming this through indicate certain levels of precision for
high-quality independent research. examination decisions. Herein rests a
potential weakness of the APA standards: a
Finally, this meta-analysis should be simplistic interpretation of the requirements
repeated at some future time, with the suggests that anyone could press ink onto

59 In fairness to the developers of these methods, every “lie detection” method in the past 100 years that was

researched by the developer or by an enthusiastic user evaluating his or her own examinations has reported
accuracy approaching perfection. It is one of the hallmarks of advocacy research in all fields, not just lie detection.
The trend includes Marston’s discontinuous blood pressure technique, Summer’s Pathometer in the 1930s, MacNitt
with the Relevant-Irrelevant technique in the 1940s, Lykken’s GKT, Farwell’s “Brain Fingerprinting,” and the
Computer Voice Stress Analyzer. In each case the authors reported stellar accuracy, usually greater than 99%. In all
these cases, however, subsequent research was either absent or resulted in accuracies significantly lower than the
original reports.

60 The APA has adopted a standard for research, which can be found online at www.polygraph.org and printed in the

journal Polygraph.

259 Polygraph, 2011, 40(4)


Validated Techniques

paper two times in self-published volumes, A conservative assessment would suggest that
claiming perfect or near perfect accuracy, and the practice of conducting experimental
subsequently claim compliance with the APA methods on the public, when effective
standards for test validation at the highest evidence-based methods are available and
categorical level – for evidentiary testing. easily implemented with no additional costs,
While this would be viewed by scientific may be considered reasonable under some
thinkers with some skepticism, the example circumstances, but calls for compelling
illustrates the need in meta-analytic research justification and includes ethical requirements
for the definition of more rigorous study for informed consent and notification.
inclusion and exclusion criteria. Just as not
all evidence is good evidence, not all Finally, this meta-analysis should be
publications are useful. Publication itself is considered an information resource only, and
not an endorsement of fact, and merely the results of this study should not be
indicates that editors and reviewers agreed interpreted as APA policy. No attempt should
that the work would be of some interest to the be made to represent or interpret this
profession. document or the results of this study as the
only or final authority on PDD test validation.
Nothing in this document should be Other, equally reasonable, approaches are
taken to suggest that we presently know also possible regarding the evaluation of the
everything we need to know, or everything scientific literature on PDD testing.
there is to know, about PDD testing. There is
always more to learn and there is always a This project was completed with the
need for continued research. More goal of summarizing the existing published
information will undoubtedly become available scientific literature regarding PDD techniques
in the future, and it is incumbent on and criterion accuracy, and to provide a
professionals to continue to incorporate convenient resource to those who may wish to
practices based on new and improving avoid the burden of reviewing the research
evidence from high quality scientific studies. literature for themselves. Every effort has
To do otherwise is to subject the future of the been extended to not only provide
profession to opinion, which will be vulnerable conclusions, but also in-depth explanations
to personalities, politics, and personal that underlie those conclusions so that
interests. In the strictest sense, PDD readers can better understand their basis.
techniques for which there is an inadequate
basis of published and replicated scientific The information herein is provided to
studies must be considered experimental, the APA Board to advise its professional
regardless of how long they have existed. membership of the strength of validation of
However, the suggestion of abandoning un- PDD techniques in present use. This
validated, ineffective, or experimental methods information is intended only to help PDD
that have long been used in field practice is professionals make informed decisions
not without controversy. regarding the selection of PDD techniques for
use in field settings. It may also assist
It can be, and has been, argued that program administrators, policy makers, and
some unstudied or experimental methods may courts to make evidence-based decisions
work as well or better than some proven about the informational value of PDD test
methods, and may provide specialized benefits results in general. Nothing should prevent the
in certain ways. Conversely, it is also possible use of any PDD technique that is supported
that experimental and unproven methods do by scientific research that demonstrates an
not work as well as those with evidence of accuracy rate significantly greater than
scientific validity. In the worst circumstances, chance, so long as that use is compliant with
the use of experimental methods could result the requirements of local laws, regulations,
in otherwise-avoidable adverse consequences. and enforceable standards.

How to cite this document:

American Polygraph Association (2011). Meta-analytic survey of criterion accuracy of validated polygraph
techniques. [Electronic version] Retrieved <DATE>, from http://www .polygraph.org.

Polygraph, 2011, 40(4) 260


Ad Hoc Committee on Validated Techniques

References

* indicates studies that were included in the meta-analysis.

√indicates studies that were cited only in the appendices.

Abrams, S. (1973). Polygraph validity and reliability: A review. Journal of Forensic Sciences, 18,
313-326.

Abrams, S. (1977). A polygraph handbook for attorneys. Lexington, MA: Lexington Books.

Abrams, S. (1989). The complete polygraph handbook. Lexington, MA: Lexington Books.

Abrams, S. (1984). The question of the intent question. Polygraph, 13, 326-332.

Anderson, C. A., Lindsay, J. J., & Bushman, B. J. (1999). Research in the psychological
laboratory: Truth or triviality? Current Directions In Psychological Science, 8, 3-9.

Ansley, N. (1983). A compendium on polygraph validity. Polygraph, 12, 53-61.

Ansley, N. (1989). Accuracy and utility of RI screening by student examiners at DODPI. Polygraph
and Personnel Security Research. Office of Security. National Security Agency. Fort
George G. Meade, MD.

Ansley, N. (1990). The validity and reliability of polygraph decisions in real cases. Polygraph, 19,
169-181.

Ansley, N. (1992). The history and accuracy of guilty knowledge and peak of tension tests.
Polygraph, 21, 174-247.

Backster, C. (1963). Standardized polygraph notepack and technique guide: Backster zone
comparison technique. Cleve Backster: New York.

Backster School of Lie Detection (2011). Basic polygraph examiner's course chart interpretation
notebook. Backster School of Lie Detection: San Diego.

Barland, G. H., Honts, C. R., & Barger, S. D. (1989). Studies of the accuracy of security screening
polygraph examinations. Department of Defense Polygraph Institute.

Barland, G. H. & Raskin, D. C. (1975). Psychopathy and detection of deception in criminal


suspects. Psychophysiology, 12, 224.

√Bell, B. G., Kircher, J. C., & Bernhardt, P. C. (2008). New measures improve the accuracy of the
directed-lie test when detecting deception using a mock crime. Physiology and Behavior,
94, 331-340.

√Bell, B. G., Raskin, D. C., Honts, C. R., & Kircher, J. C. (1999). The Utah numerical scoring
system. Polygraph, 28(1), 1-9.

Blackstone, K. (2011). Polygraph, Sex Offenders, and the Court: What Professionals Should Know
About Polygraph..., and a Lot More. Concord, MA: Emerson Books.

261 Polygraph, 2011, 40(4)


Validated Techniques

*Blackwell, J. N. (1998). PolyScore 33 and psychophysiological detection of deception examiner


rates of accuracy when scoring examination from actual criminal investigations. Available at
the Defense Technical Information Center. DTIC AD Number A355504/PAA. Reprinted in
Polygraph, 28(2) 149-175.

*Blalock, B., Cushman, B., & Nelson, R. (2009). A replication and validation study on an
empirically based manual scoring system. Polygraph, 38, 281-288.

Blalock, B., Nelson, R., Handler, M., & Shaw, P. (2011). A position paper on the use of directed lie
comparison questions in diagnostic and screening polygraphs. Police Polygraph Digest, 2-5.

Brownlie, C., Johnson, G. J., & Knill, B. (1997) Validation study of the relevant/irrelevant
screening format. Unpublished report.

Capps, M. H. (1991). Predictive value of the sacrifice relevant. Polygraph, 20(1), 1-8.

√Capps, M. H. & Ansley, N. (1992). Comparison of two scoring scales. Polygraph, 21, 39-43.

Capps, M. H., Knill, B. L., & Evans, R. K. (1993). Effectiveness of the symptomatic questions.
Polygraph, 22, 285-298.

√Correa, E. J. & Adams, H. E. (1981). The validity of the pre-employment polygraph examination
and the effects of motivation. Polygraph, 10, 143-155.

Crewson, P. E. (2001). A comparative analysis of polygraph with other screening and diagnostic
tools. Research Support Service. Report No. DoDPI01-R-0003. Reprinted in Polygraph 32,
(57-85).

Department of Defense (2006). Federal psychophysiological detection of deception examiner


handbook. Reprinted in Polygraph, 40(1), 2-66.

*Driscoll, L. N., Honts, C. R., & Jones, D. (1987). The validity of the positive control physiological
detection of deception technique. Journal of Police Science and Administration, 15, 46-50.
Reprinted in Polygraph, 16(3), 218-225.

√Forman, R. F. & McCauley, C. (1986). Validity of the positive control polygraph test using the
field practice model. Journal of Applied Psychology, 71, 691-698. Reprinted in Polygraph,
16(2), 145-160.

√Ganguly, A. K., Lahri, S. K., & Bhaseen, V. (1986). Detection of deception by conventional
qualitative method and its confirmation by quantitative method - An experimental study in
polygraphy. Polygraph, 15, 203-210.

√Ginton, A., Daie, N., Elaad, E., & Ben-Shakhar, G. (1982). A method for evaluating the use of
the polygraph in a real-life situation. Journal of Applied Psychology, 67, 131-137.

√Gordon, N. J. (1999). The academy for scientific investigative training's horizontal scoring
system and examiner's algorithm system for chart interpretation. Polygraph, 28, 56-64.

√Gordon, N. J., Fleisher, W. L., Morsie, H., Habib, W., & Salah, K. (2000). A field validity study of
the integrated zone comparison technique. Polygraph, 29, 220-225.

Polygraph, 2011, 40(4) 262


Ad Hoc Committee on Validated Techniques

*Gordon, N. J., Mohamed, F. B., Faro, S. H., Platek, S. M., Ahmad, H., & Williams, J. M. (2005).
Integrated zone comparison polygraph technique accuracy with scoring algorithms.
Physiology & behavior, 87(2), 251-254. (Same study is described in Mohamed, F. B., Faro,
S. H., Gordon, N. J., Platek, S. M., Ahmad, H. & Williams, J.M. (2006).)

√Handler, M. (2006). The Utah PLC. Polygraph, 35, 139-148.

Handler, M. & Nelson, R. (2008). Utah approach to comparison question polygraph testing.
European Polygraph, 2, 83-119.

√Handler, M. & Nelson, R. (In press). Criterion validity of the United States Air Force Modified
General Question Technique and three position scoring. Polygraph.

*Handler, M., Nelson, R., Goodson, W., & Hicks, M. (2010). Empirical Scoring System: A cross-
cultural replication and extension study of manual scoring and decision policies.
Polygraph, 39, 200-215.

√Harwell, E. M. (2000). A comparison of 3- and 7-position scoring scales with field examinations.
Polygraph, 29, 195-197.

Hilliard, D. L. (1979). A cross analysis between relevant questions and a generalized intent to
answer truthfully question. Polygraph, 8, 73-77.

*Honts, C. R. (1996). Criterion development and validity of the CQT in field application. The
Journal of General Psychology, 123, 309-324.

Honts, C. R., & Amato, S. L. (1999). The automated polygraph examination: Final report of U. S.
Government Contract No. 110224-1998-MO. Boise State University.

*Honts, C. R., Amato, S. & Gordon, A. (2004). Effects of outside issues on the comparison
question test. Journal of General Psychology, 131(1), 53-74.

√Honts, C. R. & Driscoll, L. N. (1987). An evaluation of the reliability and validity of rank order
and standard numerical scoring of polygraph charts. Polygraph, 16, 241-257.

√Honts, C. R. & Hodes, R. L. (1983). The detection of physical countermeasures. Polygraph, 12,
7-17.

*Honts, C. R., Hodes, R. L., & Raskin, D. C. (1985). Effects of physical countermeasures on the
physiological detection of deception. Journal of Applied Psychology, 70(1), 177-187.

Honts, C. R. & Peterson, C. F. (1997). Brief of the Committee of Concerned Social Scientists as
Amicus Curiae United States v Scheffer. Available from the author.

*Honts, C. R. & Raskin, D. (1988). A field study of the validity of the directed lie control question.
Journal of Police Science and Administration, 16(1), 56-61.

*Honts, C. R., Raskin, D. C., & Kircher, J. C. (1987). Effects of physical countermeasures and
their electromyographic detection during polygraph tests for deception. Psychophysiology,
1, 241-247.

Honts, C. R., & Reavy, R. (2009). Effects of Comparison Question Type and Between Test
Stimulation on the Validity of Comparison Question Test. US Army Research Office: Grant
Number W911NF-07-1-0670.

263 Polygraph, 2011, 40(4)


Validated Techniques

*Horowitz, S. W., Kircher, J. C., Honts, C. R., & Raskin, D. C. (1997). The role of comparison
questions in physiological detection of deception. Psychophysiology, 34, 108-115.

√Horvath, F. S. (1977). The effect of selected variables on interpretation of polygraph records.


Journal of Applied Psychology, 62, 127-136.

√Horvath F. S. (1988). The utility of control questions and the effects of two control question
types in field polygraph techniques. Journal of Police Science and Administration, 16(3),
198-209. Reprinted in Polygraph, 20, 7-25.

Horvath, F. S. (1994). The value and effectiveness of the sacrifice relevant question: An empirical
assessment. Polygraph, 23, 261-279.

√Horvath, F. & Palmatier, J. (2008). Effect of two types of control questions and two question
formats on the outcomes of polygraph examinations. Journal of Forensic Sciences, 53(4), 1-
11.

√Horvath, F. S. & Reid, J. E. (1971). The reliability of polygraph examiner diagnosis of truth and
deception. Journal of Criminal Law, Criminology and Police Science, 62, 276-281.

√Hunter, F. L., & Ash, P. (1973). The accuracy and consistency of polygraph examiners'
diagnosis. Journal of Police Science and Administration, 1, 370-375.

Jayne, B. (1989). A comparison between the predictive value of two common preemployment
screening procedures. The Investigator, 5(3).

√Jayne, B. C. (1990). Contributions of physiological recordings in the polygraph technique.


Polygraph, 19, 105-117.

Kircher, J. C., Kristjiansson, S. D., Gardner, M. K., & Webb, A. (2005). Human and computer
decision-making in the psychophysiological detection of deception. University of Utah.

*Kircher, J. C. & Raskin, D. C. (1988). Human versus computerized evaluations of polygraph data
in a laboratory setting. Journal of Applied Psychology, 73, 291-302.

Kokish, R., Levenson, J. S., & Blasingame, G. D. (2005). Post-conviction sex offender polygraph
examination: client-reported perceptions of utility and accuracy. Sexual Abuse : A Journal
of Research and Treatment, 17, 211-21.

√Krapohl, D. J. (1998). A comparison of 3- and 7- position scoring scales with laboratory data.
Polygraph, 27, 210-218.

Krapohl, D. J. (2002). Short report: Update for the objective scoring system. Polygraph, 31, 298-
302.

√Krapohl, D. J. (2005). Polygraph decision rules for evidentiary and paired testing (Marin
protocol) applications. Polygraph, 34, 184-192.

Krapohl, D. J. (2006). Validated polygraph techniques. Polygraph, 35(3), 149-155.

Krapohl, D. J. (2010). Short report: A test of the ESS with two-question field cases. Polygraph,
39, 124-126.

Polygraph, 2011, 40(4) 264


Ad Hoc Committee on Validated Techniques

*Krapohl, D. J. & Cushman, B. (2006). Comparison of evidentiary and investigative decision


rules: A replication. Polygraph, 35(1), 55-63.

√Krapohl, D. J., Dutton, D. W. & Ryan, A. H. (2001). The rank order scoring system: Replication
and extension with field data. Polygraph, 30, 172-181.

Krapohl, D. J. & McManus, B. (1999). An objective method for manually scoring polygraph data.
Polygraph, 28, 209-222.

√Krapohl, D. J. & Norris, W. F. (2000). An exploratory study of traditional and objective scoring
systems with MGQT field cases. Polygraph, 29, 185-194.

Krapohl, D. J. & Ryan, A. H. (2001). A belated look at symptomatic questions. Polygraph, 30,
206-212.

√Krapohl, D. J., Senter, S. M., & Stern, B. A. (2005). An exploration of methods for the analysis of
multiple-issue Relevant/Irrelevant screening data. Polygraph, 34(1), 47-62.

Lykken, D. T. (1959). The GSR in the detection of guilt. Journal of Applied Psychology, 43, 385-
388.

*MacLaren, V. V. (2001). A quantitative review of the guilty knowledge test. The Journal of
Applied Psychology, 86, 674-683.

*Mangan, D. J., Armitage, T. E., & Adams, G. C. (2008). A field study on the validity of the
Quadri-Track Zone Comparison Technique. Physiology and Behavior, 17-23.

√Matte, J. A. (1990). Validation study on the polygraph Quadri-Zone Comparison Technique.


Research Abstract LD 01452, Vol. 1502, 1989, University Microfilm International (UMI),
Ann Arbor, MI.

√Matte, J. A. (2010). A field study of the Backster Zone Comparison Technique's Either Or Rule
and scoring system versus two other scoring systems when relevant question elicits strong
response. European Polygraph, 4, 53-69.

*Matte, J. A. & Reuss, R. M. (1989). A field validation study of the Quadri-Zone Comparison
Technique. Polygraph, 18, 187-202.

√Meiron, E., Krapohl, D. J., & Ashkenazi, T. (2008). An assessment of the Backster “Either-Or”
Rule in polygraph scoring. Polygraph, 37, 240-249.

Mohamed, F. B., Faro, S. H., Gordon, N. J., Platek, S. M., Ahmad, H., & Williams, J. M. (2006).
Brain mapping of deception and truth telling about an ecologically valid situation:
functional MR imaging and polygraph investigation--initial experience. Radiology, 238,
679-88.

National Research Council (2003). The Polygraph and Lie Detection. Washington, D.C.: National
Academy of Sciences.

*Nelson, R. (In press). Monte Carlo study of criterion validity of Backster You-Phase
examinations. Polygraph.

265 Polygraph, 2011, 40(4)


Validated Techniques

*Nelson, R. (In press). Monte Carlo study of criterion validity of the Directed Lie Screening Test
using the seven-position, three-position and Empirical Scoring Systems. Polygraph.

*Nelson, R. (2011). Monte Carlo study of criterion validity for two-question zone comparison tests
with the Empirical Scoring System, seven-position, and three-position scoring models.
Polygraph, 40, 146-156.

*Nelson, R. & Blalock, B. (In press). Extended analysis of Senter, Waller and Krapohl's AFMGQT
examination data with the Empirical Scoring System and the Objective Scoring System,
version 3. Polygraph, (In press).

*Nelson, R., Blalock, B., & Handler, M. (2011). Criterion validity of the Empirical Scoring System
and the Objective Scoring System, version 3 with the USAF Modified General Question
Technique. Polygraph, 40, 172-179.

*Nelson, R., Blalock, B., Oelrich, M., & Cushman, B. (2011). Reliability of the Empirical Scoring
System with expert examiners. Polygraph, 40, 131-139.

Nelson, R. & Handler, M. (2010). Empirical Scoring System. Lafayette Instrument Company.

*Nelson, R. & Handler, M. (In press). Monte Carlo study of the United States Air Force Modified
General Question Technique with two three and four questions. Polygraph.

*Nelson, R., Handler, M., Adams, G., & Backster, C. (In press). Survey of reliability and criterion
validity of Backster numerical scores of You-Phase exams from confirmed field
investigations. Polygraph.

*Nelson, R., Handler, M., Blalock, B., & Cushman, B. (In press). Blind scoring of confirmed
federal You-Phase examinations by experienced and inexperienced examiners: Criterion
validity with the Empirical Scoring System and the seven-position model. Polygraph.

*Nelson, R., Handler, M., Blalock, B., & Hernández, N. (In press). Replication and extension study
of Directed Lie Screening Tests: Criterion validity with the seven- and three-position models
and the Empirical Scoring System. Polygraph.

*Nelson, R., Handler, M., & Morgan, C. (In press). Criterion validity of the Directed Lie Screening
Test and the Empirical Scoring System with inexperienced examiners and non-naive
examinees in a laboratory setting. Polygraph.

*Nelson, R., Handler, M., Morgan, C., & O’Burke, P. (In press). Criterion validity of the United
States Air Force Modified General Question Technique and Iraqi scorers. Polygraph.

*Nelson, R., Handler, M., & Senter, S. (In press). Monte Carlo study of criterion validity of the
Directed Lie Screening Test using the Empirical Scoring System and the Objective Scoring
System version 3. Polygraph.

*Nelson, R., Handler, M., Shaw, P., Gougler, M., Blalock, B., Russell, C., Cushman, B., & Oelrich,
M. (2011). Using the Empirical Scoring System. Polygraph, 40(2), 67-78.

*Nelson, R. & Krapohl, D. (2011). Criterion validity of the Empirical Scoring System with
experienced examiners: Comparison with the seven-position evidentiary model using the
Federal Zone Comparison Technique. Polygraph, 40, 79-85.

Polygraph, 2011, 40(4) 266


Ad Hoc Committee on Validated Techniques

*Nelson, R., Krapohl, D., & Handler, M. (2008). Brute force comparison: A Monte Carlo study of
the Objective Scoring System version 3 (OSS-3) and human polygraph scorers. Polygraph,
37, 185-215.

Office of Technology Assessment (1983). The validity of polygraph testing: A research review and
evaluation. Washington, D.C.: U.S. Congress, Office of Technology Assessment.

√Patrick, C. J. & Iacono, W. G. (1989). Psychopathy, threat and polygraph test accuracy. Journal
of Applied Psychology, 74, 347-355.

√Patrick, C. J. & Iacono, W. G. (1991). Validity of the control question polygraph test: The
problem of sampling bias. Journal of Applied Psychology, 76, 229-238.

Podlesny, J. A. & Raskin, D. C. (1978). Effectiveness of techniques and physiological measures in


the detection of deception. Psychophysiology, 15, 344-359.

Podlesny, J., Raskin, D., & Barland, G. (1976). Effectiveness of Techniques and Physiological
Measures in the Detection of Deception. Report No. 76-5, Contract 75-N1-99-001 LEAA
(available through Department of Psychology, University of Utah, Salt Lake City).

Podlesny, J. A. & Truslow, C. M. (1993). Validity of an expanded-issue (modified general question)


polygraph technique in a simulated distributed-crime-roles context. Journal of Applied
Psychology, 78, 788-797.

Pollina, D. A., Dollins, A. B., Senter, S. M., Krapohl, D. J. & Ryan, A. H. (2004). Comparison of
polygraph data obtained from individuals involved in mock crimes and actual criminal
investigations. Journal of applied psychology, 89, 1099-105.

√Raskin, D. C. & Hare, R. D. (1978). Psychopathy and detection of deception in a prison


population. Psychophysiology, 15, 126-136.

Raskin, D. C. & Honts, C. R. (2002). Handbook of polygraph testing. In M. Kleiner (Ed.),


Handbook of Polygraph Testing. San Diego: Academic Press.

Raskin, D. C. & Podlesny, J. A. (1979). Truth and deception: A reply to Lykken. Psychological
Bulletin, 86, 54-59.

Reid, J. E. (1947). A revised questioning technique in lie detection tests. Journal of Criminal Law
and Criminology, 37, 542-547. Reprinted in Polygraph 11, 17-21.

√Reid, J. E. & Inbau, F. E. (1977). Truth and deception: The polygraph ('lie detector') technique
(2nd ed). Baltimore, MD: Williams & Wilkins.

*Research Division Staff (1995a). A comparison of psychophysiological detection of deception


accuracy rates obtained using the counterintelligence scope Polygraph and the test for
espionage and sabotage question formats. DTIC AD Number A319333. Department of
Defense Polygraph Institute. Fort Jackson, SC. Reprinted in Polygraph, 26(2), 79-106.

*Research Division Staff (1995b). Psychophysiological detection of deception accuracy rates


obtained using the test for espionage and sabotage. DTIC AD Number A330774.
Department of Defense Polygraph Institute. Fort Jackson, SC. Reprinted in Polygraph,
27(3), 171-180.

267 Polygraph, 2011, 40(4)


Validated Techniques

√Research Division Staff (2001). Test of a mock theft scenario for use in the Psychophysiological
Detection of Deception: IV. Report No. DoDPI00-R-0002. Department of Defense Polygraph
Institute. Reprinted in Polygraph 30(4) 244-253.

√Rovner, L. I. (1986). Accuracy of physiological detection of deception for subjects with prior
knowledge. Polygraph, 15(1), 1-39.

Senter, S. M. (2003). Modified general question test decision rule exploration. Polygraph, 32, 251-
263.

Senter, S. M. & Dollins, A. B. (2002). New Decision Rule Development: Exploration of a two-stage
approach. Report number DoDPI00-R-0001. Department of Defense Polygraph Institute
Research Division, Fort Jackson, SC. Reprinted in Polygraph 37, 149-164.

Senter, S. & Dollins, A. B. (2004). Comparison of question series and decision rules: A replication.
Polygraph, 33, 223-233.

Senter, S. M. & Dollins, A. B. (2008). Optimal decision rules for evaluating psychophysiological
detection of deception data: an exploration. Polygraph, 37(2), 112-124.

*Senter, S., Waller, J., & Krapohl, D. (2008). Air Force Modified General Question Test validation
study. Polygraph, 37(3), 174-184.

Senter, S., Weatherman, D., Krapohl, D., & Horvath, F. (2010). Psychological set or differential
salience: A proposal for reconciling theory and terminology in polygraph testing. Polygraph,
39 (2), 109-117.

*Shurani, T. (2011). Polygraph verification test. European Polygraph, 16.

*Shurani, T. & Chaves, F. (2010). Integrated Zone Comparison Technique and ASIT PolySuite
algorithm: A field validity study. European Polygraph, 4(2), 71-80.

*Shurani, T., Stein, E. & Brand, E. (2009). A Field Study on the Validity of the Quadri-Track Zone
Comparison Technique. European Polygraph, 1, 5-24.

√Slowik, S. M. & Buckley, J. P., III (1975). Relative accuracy of polygraph examiner diagnosis of
respiration, blood pressure and GSR recordings. Journal of Police Science and
Administration, 3, 305-309.

√Van Herk, M. (1990). Numerical evaluation: Seven point scale +/-6 and possible alternatives: A
discussion. The Newsletter of the Canadian Association of Police Polygraphists, 7, 28-47.
Reprinted in Polygraph, 20(2), 70-79.

Verschuere, B., Meijer, E., & Merckelbach, H. (2008). The Quadri-Track Zone Comparison
Technique: It's just not science. A critique to Mangan, Armitage, and Adams (2008).
Physiology and Behavior, 1-2, 27-28.

√Wicklander, D. E. & Hunter, F. L. (1975). The influence of auxiliary sources of information in


polygraph diagnosis. Journal of Police Science and Administration, 3, 405-409.

Polygraph, 2011, 40(4) 268


Ad Hoc Committee on Validated Techniques

Appendix A
Sample Sizes of Included Studies

Total N N Total Deceptive Truthful


PDD Technique Study Scorers
N Deceptive Truthful Scores Scores Scores
1
AFMGQT (7-position) Senter, Waller & Krapohl (2008) 69 33 36 69 33 36 1
AFMGQT (7-position) Nelson, Handler, Morgan & O'Burke (In press) 2 22 11 11 66 33 33 3
AFMGQT (7-position) Nelson, Handler, & Senter (In press) 3 A - - - 100 50 50 1
AFMGQT (ESS) Nelson, Blalock & Handler (2011) 2 - - - 66 33 33 3
AFMGQT (ESS) Nelson & Blalock (In press) 1 - - - 69 33 36 1
AFMGQT (ESS) Nelson, Handler, & Senter (In press) 3 A 100 50 50 100 50 50 1
Backster You-Phase (Backster) Nelson, Handler, Adams & Backster (In press) 4 22 11 11 154 77 77 7
Backster You-Phase (Backster) Nelson (In press) 100 50 50 100 50 50 1
CIT MacLaren 2001 1,070 666 404 1,070 666 404 39
DLST/TES (7-position) Research Division Staff 1995a 94 26 68 94 26 68 3
DLST/TES (7-position) Research Division Staff 1995b 85 30 55 85 30 55 10
DLST/TES (7-position) Nelson (In press) B 100 50 50 100 50 50 1
DLST/TES (7-position) Nelson Handler Blalock & Hernández (In press) 5 C 49 25 24 98 50 48 2
DLST/TES (ESS) Nelson & Handler (In press) 100 50 50 100 50 50 1
DLST/TES (ESS) Nelson, Handler & Morgan (In press) 49 24 25 49 24 25 1
DLST/TES (ESS) Nelson (In press) B - - - 100 50 50 1
DLST/TES (ESS) Nelson, Handler, Blalock & Hernández (In press) 5 C - - - 98 50 48 2
Federal You-Phase (7-position) Nelson (2011) D 100 50 50 100 50 50 1
Federal You-Phase (7-position) Nelson, Handler, Blalock & Cushman (In press) 6 E - - - 220 110 110 10
Federal You-Phase (ESS) Nelson (2011) D 100 50 50 100 50 50 1
Federal You-Phase (ESS) Nelson, Handler, Blalock & Cushman (In press) 6 E - - - 220 110 110 10
Federal ZCT (7-position) Blackwell (1998) 100 65 35 300 195 105 3
Federal ZCT (7-position) Krapohl & Cushman (2006) 7 F 100 50 50 1,000 500 500 10
Honts, Amato & Gordon (2004) as reported in Honts
Federal ZCT (7-position) 48 24 24 144 72 72 3
in Grahnag (2004)
7F
Federal ZCT (7-position evidentiary) Krapohl & Cushman (2006) - - - 1,000 500 500 10
Federal ZCT (7-position evidentiary) Nelson & Krapohl (2011) 8 G 60 30 30 60 30 30 6
IZCT (Horizontal) Shurani & Chavez (2010) 84 44 40 84 44 40 4
Gordon, Mohamed, Faro, Platek, Ahmad & Williams
IZCT (Horizontal) 11 6 5 11 6 5 1
(2005)
IZCT (Horizontal) Shurani (2011) 84 36 48 84 36 48 3
MQTZCT (Matte) Matte & Reuss (1989) dissertation 122 64 58 122 64 58 2
MQTZCT (Matte) Shurani, Stein & Brand (2009) 57 28 29 57 28 29 4
MQTZCT (Matte) Mangan, Armitage & Adams (2008) 140 91 49 140 91 49 1
Utah-RCMP/CPC (Utah) Honts, Hodes & Raskin (1985) 38 19 19 38 19 19 1
Utah-RCMP/CPC (Utah) Driscoll, Honts & Jones, 1987) 40 20 20 40 20 20 1
Utah-RCMP/CPC (Utah) Honts (1996) 32 21 11 32 21 11 1
Utah-DLC (Utah) Honts & Raskin (1988) 25 12 13 25 12 13 1
Utah-DLC (Utah) Horowitz, Kircher, Honts & Raskin (1997) 30 15 15 30 15 15 1
Utah-DLC (Utah) Kircher & Raskin (1988) 100 50 50 200 100 100 2
Utah-DLC (Utah) Honts, Raskin & Kircher (1987) 20 10 10 20 10 10 1
ZCT (ESS) Nelson, Krapohl & Handler (2008) 7 - - - 700 350 350 7
ZCT (ESS) Nelson, Blalock, Oelrich & Cushman (2011) 7 - - - 250 150 100 25
ZCT (ESS) Nelson & Krapohl (2011) 8 G - - - 60 30 30 6
ZCT (ESS) Nelson et al (2011) 572 304 268 1,382 741 641 74
ZCT (ESS) Blalock, Cushman & Nelson (2009) 7 - - - 900 450 450 9
ZCT (ESS) Handler, Nelson, Goodson & Hicks (2010) 7 - - - 1,900 950 950 19
1-8
Sample scores based on the same sample cases.
A-G
Sample scores published in the same study.

269 Polygraph, 2011, 40(4)


Validated Techniques

Appendix B

Criterion Accuracy of Included Studies

Unweighted Unweighted
PDD Technique Study Sens. Spec. FN FP D-INC T-INC
Accuracy INC
AFMGQT (7-position) Senter, Waller & Krapohl (2008) 1 .758 .917 .212 .083 .030 .001 .849 .015
AFMGQT (7-position) Nelson, Handler, Morgan & O'Burke (In press) 2 .818 .364 .001 .333 .182 .303 .761 .242
AFMGQT (7-position) Nelson, Handler, & Senter (In press) 3 A .780 .420 .040 .200 .140 .420 .814 .280
AFMGQT (ESS) Nelson, Blalock & Handler (2011) 2 .831 .616 .010 .175 .158 .208 .883 .183
AFMGQT (ESS) Nelson & Blalock (In press) 1 .511 .862 .211 .027 .277 .028 .839 .152
AFMGQT (ESS) Nelson, Handler, & Senter (In press) 3 A .806 .639 .067 .131 .127 .229 .876 .178
Backster You-Phase (Backster) Nelson, Handler, Adams & Backster (In press) 4 .943 .543 .009 .274 .048 .183 .828 .116
Backster You-Phase (Backster) Nelson (In press) .668 .592 .019 .079 .313 .329 .927 .321
CIT MacLaren 2001 .815 .832 .185 .168 .001 .001 .823 .000
DLST/TES (7-position) Research Division Staff 1995a .654 .676 .154 .206 .192 .118 .788 .155
DLST/TES (7-position) Research Division Staff 1995b .833 .909 .167 .073 .000 .018 .880 .009
DLST/TES (7-position) Nelson (In press) B .910 .677 .037 .184 .053 .139 .874 .096
DLST/TES (7-position) Nelson Handler Blalock & Hernández (In press) 5 C .583 .940 .271 .020 .145 .039 .831 .092
DLST/TES (ESS) Nelson & Handler (In press) .917 .587 .036 .253 .047 .160 .831 .104
DLST/TES (ESS) Nelson, Handler & Morgan (In press) .625 .950 .210 .040 .165 .010 .854 .088
DLST/TES (ESS) Nelson (In press) B .935 .730 .046 .195 .020 .075 .871 .048
DLST/TES (ESS) Nelson, Handler, Blalock & Hernández (In press) 5 C .665 .839 .207 .040 .126 .119 .859 .123
Federal You-Phase (7-position) Nelson (2011) D .833 .417 .010 .138 .157 .444 .870 .301
Federal You-Phase (7-position) Nelson, Handler, Blalock & Cushman (In press) 6 E .844 .730 .036 .171 .119 .097 .885 .108
Federal You-Phase (ESS) Nelson (2011) D .813 .729 .050 .126 .090 .102 .897 .096
Federal You-Phase (ESS) Nelson, Handler, Blalock & Cushman (In press) 6 E .859 .770 .027 .143 .145 .325 .906 .235
Federal ZCT (7-position) Blackwell (1998) .923 .448 .015 .295 .062 .257 .793 .159
Federal ZCT (7-position) Krapohl & Cushman (2006) 7 F .824 .560 .044 .180 .132 .260 .853 .196
Honts, Amato & Gordon (2004) as reported in Honts in
Federal ZCT (7-position) .917 .917 .001 .083 .083 .000 .958 .042
Grahnag (2004)
7F
Federal ZCT (7-pos. evidentiary) Krapohl & Cushman (2006) .792 .824 .122 .116 .086 .060 .872 .073
Federal ZCT (7-pos. evidentiary) Nelson & Krapohl (2011) 8 G .933 .667 .000 .233 .067 .133 .870 .100
IZCT (Horizontal) Shurani & Chavez (2010) .955 .900 .023 .001 .023 .100 .988 .061
Gordon, Mohamed, Faro, Platek, Ahmad & Williams
IZCT (Horizontal) .999 .800 .001 .001 .001 .200 .999 .100
(2005)
IZCT (Horizontal) Shurani (2011) .999 .999 .001 .001 .001 .001 .999 .000
MQTZCT (Matte) Matte & Reuss (1989) dissertation .969 .914 .000 .001 .031 .086 .999 .059
MQTZCT (Matte) Shurani, Stein & Brand (2009) .929 1.000 .071 .001 .000 .001 .964 .000
MQTZCT (Matte) Mangan, Armitage & Adams (2008) .978 1.000 .001 .001 .022 .001 .999 .011
Utah-RCMP/CPC (Utah) Honts, Hodes & Raskin (1985) .895 .421 .001 .211 .105 .368 .833 .237
Utah-RCMP/CPC (Utah) Driscoll, Honts & Jones, 1987) .900 .900 .001 .001 .100 .100 .999 .100
Utah-RCMP/CPC (Utah) Honts (1996) .714 .818 .048 .001 .238 .182 .969 .210
Utah-DLC (Utah) Honts & Raskin (1988) .917 .846 .083 .001 .001 .154 .958 .077
Utah-DLC (Utah) Horowitz, Kircher, Honts & Raskin (1997) .733 .867 .133 .133 .133 .001 .856 .067
Utah-DLC (Utah) Kircher & Raskin (1988) .880 .860 .060 .060 .060 .080 .935 .070
Utah-DLC (Utah) Honts, Raskin & Kircher (1987) .800 .700 .001 .200 .200 .100 .889 .150
ZCT (ESS) Nelson, Krapohl & Hanlder (2008) 7 .749 .814 .154 .077 .097 .109 .872 .103
ZCT (ESS) Nelson, Blalock, Oelrich & Cushman (2011) 7 .793 .930 .073 .001 .133 .070 .958 .102
ZCT (ESS) Nelson & Krapohl (2011) 8 G .833 .633 .001 .133 .167 .233 .913 .200
ZCT (ESS) Nelson et al (2011) .863 .789 .047 .093 .103 .107 .921 .105
ZCT (ESS) Blalock, Cushman & Nelson (2009) 7 .773 .727 .122 .102 .104 .171 .870 .138
ZCT (ESS) Handler, Nelson, Goodson & Hicks (2010) 7 .865 .881 .103 .089 .040 .039 .901 .040
1-8
Sample scores based on the same sample cases.
A-G
Sample scores published in the same study.

Polygraph, 2011, 40(4) 270


Ad Hoc Committee on Validated Techniques

Appendix C
Reliability Statistics for Included Studies

Decision
PDD Technique Study Fleiss' Kappa Correlation
Agreement
1
AFMGQT (7-position) Senter, Waller & Krapohl (2008) .750 .930 .940
AFMGQT (7-position) Nelson, Handler, Morgan & O'Burke (In press) 2 - 1.000 -
AFMGQT (7-position) Nelson, Handler, & Senter (In press) 3 A - - -
AFMGQT (ESS) Nelson, Blalock & Handler (2011) 2 - 1.000 .931
AFMGQT (ESS) Nelson & Blalock (In press) 1 - - -
AFMGQT (ESS) Nelson, Handler, & Senter (In press) 3 A - - -
Backster You-Phase (Backster) Nelson, Handler, Adams & Backster (In press) 4 - - .567
Backster You-Phase (Backster) Nelson (In press) - - -
CIT MacLaren 2001 - - -
DLST/TES (7-position) Research Division Staff 1995a .760 .890 -
DLST/TES (7-position) Research Division Staff 1995b - - -
DLST/TES (7-position) Nelson (In press) B - - -
DLST/TES (7-position) Nelson Handler Blalock & Hernández (In press) 5 C - .722 -
DLST/TES (ESS) Nelson & Handler (In press) - - -
DLST/TES (ESS) Nelson, Handler & Morgan (In press) - .911 -
DLST/TES (ESS) Nelson (In press) B - - -
DLST/TES (ESS) Nelson, Handler, Blalock & Hernández (In press) 5 C - .769 -
Federal You-Phase (7-position) Nelson (2011) D - - -
Federal You-Phase (7-position) Nelson, Handler, Blalock & Cushman (In press) 6 E - .852 -
Federal You-Phase (ESS) Nelson (2011) D - - -
Federal You-Phase (ESS) Nelson, Handler, Blalock & Cushman (In press) 6 E - .897 -
Federal ZCT (7-position) Blackwell (1998) .570 .800 -
Federal ZCT (7-position) Krapohl & Cushman (2006) 7 F - - -
Honts, Amato & Gordon (2004) as reported in Honts in
Federal ZCT (7-position) - - -
Grahnag (2004)
7F
Federal ZCT (7-position evidentiary) Krapohl & Cushman (2006) - .870 -
Federal ZCT (7-position evidentiary) Nelson & Krapohl (2011) 8 G - - -
IZCT (Horizontal) Shurani & Chavez (2010) - - -
Gordon, Mohamed, Faro, Platek, Ahmad & Williams
IZCT (Horizontal) - - -
(2005)
IZCT (Horizontal) Shurani (2011) - - -
MQTZCT (Matte) Matte & Reuss (1989) dissertation - - .990
MQTZCT (Matte) Shurani, Stein & Brand (2009) - - -
MQTZCT (Matte) Mangan, Armitage & Adams (2008) - - -
Utah-RCMP/CPC (Utah) Honts, Hodes & Raskin (1985) .480 .950 .880
Utah-RCMP/CPC (Utah) Driscoll, Honts & Jones, 1987) - - .860
Utah-RCMP/CPC (Utah) Honts (1996) - .930 .910
Utah-DLC (Utah) Honts & Raskin (1988) - - .940
Utah-DLC (Utah) Horowitz, Kircher, Honts & Raskin (1997) - - .920
Utah-DLC (Utah) Kircher & Raskin (1988) .730 .990 .970
Utah-DLC (Utah) Honts, Raskin & Kircher (1987) .730 .960 -
ZCT (ESS) Nelson, Krapohl & Hanlder (2008) 7 .610 - -
ZCT (ESS) Nelson, Blalock, Oelrich & Cushman (2011) 7 - .950 -
ZCT (ESS) Nelson & Krapohl (2011) 8 G - - -
ZCT (ESS) Nelson et al (2011) - - -
ZCT (ESS) Blalock, Cushman & Nelson (2009) 7 .560 - -
ZCT (ESS) Handler, Nelson, Goodson & Hicks (2010) 7 .590 - .840
1-8
Sample scores based on the same sample cases.
A-G
Sample scores published in the same study.

271 Polygraph, 2011, 40(4)


Validated Techniques

Appendix D
Means and Standard Deviations of Criterion Deceptive and Criterion Truthful Scores

PDD Technique Study Mean D StDev D Mean T StDev T


AFMGQT (7-position) Senter, Waller & Krapohl (2008) 1 - - - -
AFMGQT (7-position) Nelson, Handler, Morgan & O'Burke (In press) 2 -2.995* 4.727* 2.365* 3.879*
AFMGQT (7-position) Nelson, Handler, & Senter (In press) 3 A -2.827* 4.504* 3.556* 3.766*
AFMGQT (ESS) Nelson, Blalock & Handler (2011) 2 -3.850* 4.730* 4.530* 5.180*
AFMGQT (ESS) Nelson & Blalock (In press) 1 -2.000* 5.030* 3.420* 3.470*
AFMGQT (ESS) Nelson, Handler, & Senter (In press) 3 A -3.031* 4.535* 3.265* 3.661*
Backster You-Phase (Backster) Nelson, Handler, Adams & Backster (In press) 4 -19.649 6.482 3.612 10.010
Backster You-Phase (Backster) Nelson (In press) -12.460 8.353 6.820 10.572
CIT MacLaren 2001 - - - -
DLST/TES (7-position) Research Division Staff 1995a - - - -
DLST/TES (7-position) Research Division Staff 1995b - - - -
DLST/TES (7-position) Nelson (In press) B -2.418* 3.818* 2.653* 3.618*
DLST/TES (7-position) Nelson Handler Blalock & Hernández (In press) 5 C -1.833* 4.099* 3.670* 3.443*
DLST/TES (ESS) Nelson & Handler (In press) -2.442* 3.531* 2.086* 3.460*
DLST/TES (ESS) Nelson, Handler & Morgan (In press) -1.271* 3.131* 4.660* 2.299*
DLST/TES (ESS) Nelson (In press) B -3.031* 5.104* 3.265* 3.935*
DLST/TES (ESS) Nelson, Handler, Blalock & Hernández (In press) 5 C -1.781* 3.437* 3.636* 2.917*
Federal You-Phase (7-position) Nelson (2011) D -6.398 4.914 5.485 5.106
Federal You-Phase (7-position) Nelson, Handler, Blalock & Cushman (In press) 6 E -7.991 6.733 6.514 6.680
Federal You-Phase (ESS) Nelson (2011) D -6.685 6.881 6.735 6.045
Federal You-Phase (ESS) Nelson, Handler, Blalock & Cushman (In press) 6 E -8.606 5.842 6.018 7.107
Federal ZCT (7-position) Blackwell (1998) -10.385 9.510 6.981 7.495
Federal ZCT (7-position) Krapohl & Cushman (2006) 7 F -6.264 10.863 9.776 8.212
Honts, Amato & Gordon (2004) as reported in Honts in Grahnag
Federal ZCT (7-position) -8.420 6.837 6.640 9.187
(2004)
7F
Federal ZCT (7-position evidentiary) Krapohl & Cushman (2006) -6.264 10.863 9.776 8.212
Federal ZCT (7-position evidentiary) Nelson & Krapohl (2011) 8 G -9.600 7.356 6.926 10.709
IZCT (Horizontal) Shurani & Chavez (2010) -8.847 15.264 21.181 5.097
IZCT (Horizontal) Gordon, Mohamed, Faro, Platek, Ahmad & Williams (2005) -36.000 12.946 8.750 1.635
IZCT (Horizontal) Shurani (2011) -19.667 9.607 28.948 5.963
MQTZCT (Matte) Matte & Reuss (1989) dissertation -9.148+ 2.843+ 3.099+ 6.002+
MQTZCT (Matte) Shurani, Stein & Brand (2009) -6.949+ 1.630+ 5.388+ 1.246+
MQTZCT (Matte) Mangan, Armitage & Adams (2008) -10.037+ 2.995+ 7.190+ 3.189+
Utah-RCMP/CPC (Utah) Honts, Hodes & Raskin (1985) -11.950 6.520 9.000 10.660
Utah-RCMP/CPC (Utah) Driscoll, Honts & Jones, 1987) -10.700 6.000 10.350 6.470
Utah-RCMP/CPC (Utah) Honts (1996) -15.000 5.564 8.170 5.270
Utah-DLC (Utah) Honts & Raskin (1988) -11.500 5.803 9.000 5.803
Utah-DLC (Utah) Horowitz, Kircher, Honts & Raskin (1997) -7.000 13.500 8.500 11.500
Utah-DLC (Utah) Kircher & Raskin (1988) -7.710 8.420 10.785 8.671
Utah-DLC (Utah) Honts, Raskin & Kircher (1987) -14.000 7.490 9.600 7.490
ZCT (ESS) Nelson, Krapohl & Hanlder (2008) 7 -9.606 9.743 9.162 8.564
ZCT (ESS) Nelson, Blalock, Oelrich & Cushman (2011) 7 -10.740 8.263 8.690 4.585
ZCT (ESS) Nelson & Krapohl (2011) 8 G -11.833 7.764 6.000 9.592
ZCT (ESS) Nelson et al (2011) -11.354 9.392 7.373 9.270
ZCT (ESS) Blalock, Cushman & Nelson (2009) 7 -11.253 9.786 7.191 8.785
ZCT (ESS) Handler, Nelson, Goodson & Hicks (2010) 7 -7.953 10.017 11.212 8.619
1-8
Sample scores based on the same sample cases.
A-G
Sample scores published in the same study.
* Means and standard deviations are reported as subtotal scores for individual questions.
+
Means and standard deviations are reported for subtotal scores for individual test charts.

Polygraph, 2011, 40(4) 272


Ad Hoc Committee on Validated Techniques

Appendix E-1
AFMGQT / Seven-position TDA

Nelson, Handler,
Senter, Waller & Nelson & Handler
Study Morgan & O'Burke
Krapohl (2008) (In press)
(In press)
Sample N 69 22 100
N Deceptive 33 11 50
N Truthful 36 11 50
Scorers 12 3 1
D Scores 33 33 50
T Scores 36 33 50
Total Scores 69 66 100
Mean D -2.000 -2.995 -2.827
StDev D 5.030 4.727 4.504
Mean T 3.420 2.365 3.556
StDev T 3.470 3.879 3.766
Reliability Kappa .750 - -
Reliability Agreement .930 .999 -
Reliability Correlation .940 - -
Unweighted Average
.849 .761 .814
Accuracy
Unweighted
.015 .242 .280
Inconclusives
Sensitivity .758 .818 .780
Specificity .917 .364 .420
FN Errors .212 .000 .040
FP Errors .083 .333 .200
D-INC .030 .182 .140
T-INC .000 .303 .420
PPV .901 .711 .796
NPV .812 .999 .913
D Correct .781 .999 .951
T Correct .917 .522 .677

273 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-2
AFMGQT / ESS

Nelson, Blalock Nelson & Blalock Nelson, Handler &


Study
Handler (In press) (In press) Senter (In press)
Sample N 22 69 100
N Deceptive 11 33 50
N Truthful 11 36 50
Scorers 3 1 1
D Scores 33 33 50
T Scores 33 36 50
Total Scores 66 69 100
Mean D -3.850 -2.000 -3.031
StDev D 4.730 5.030 4.535
Mean T 4.530 3.420 3.265
StDev T 5.180 3.470 3.661
Reliability Kappa - - -
Reliability Agreement .999 - -
Reliability Correlation .931 - -
Unweighted Average
.883 .839 .876
Accuracy
Unweighted
.183 .152 .178
Inconclusives
Sensitivity .831 .511 .806
Specificity .616 .862 .639
FN Errors .010 .211 .067
FP Errors .175 .027 .131
D-INC .158 .277 .127
T-INC .208 .028 .229
PPV .826 .951 .860
NPV .984 .803 .905
D Correct .988 .708 .923
T Correct .779 .970 .830

Polygraph, 2011, 40(4) 274


Ad Hoc Committee on Validated Techniques

Appendix E-3
Backster You-Phase

Nelson, Handler,
Study Adams & Backster Nelson (In press)
(In press)
Sample N 22 100
N Deceptive 11 50
N Truthful 11 50
Scorers 7 1
D Scores 77 50
T Scores 77 50
Total Scores 154 100
Mean D -19.649 -12.460
StDev D 6.482 8.353
Mean T 3.612 6.820
StDev T 10.010 10.572
Reliability Kappa - -
Reliability Agreement - -
Reliability Correlation .567 -
Unweighted Average
.825 .927
Accuracy
Unweighted
.117 .321
Inconclusives
Sensitivity .948 .668
Specificity .532 .592
FN Errors .001 .019
FP Errors .286 .079
D-INC .052 .313
T-INC .182 .329
PPV .768 .894
NPV .999 .969
D Correct .999 .972
T Correct .650 .882

275 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-4
Concealed Information Test / Guilty Knowledge Test

as reported by MacLaren (2001)

MacLaren, V. V. (2001). A quantitative review of the guilty knowledge test. Journal of Applied
Psychology, 86, 674-683.

Results are reported with all informed participants, and with only those informed participants who
also engaged in the behavioral acts.

Mean (St. Er.) {95% CI}


Informed/guilty and All informed and
uniformed participants uninformed participants
Number of studies 39 50
N Deceptive 666 843
N Truthful 404 404
Total N 1070 1243
.823 (.011) .795 (.043)
Unweighted Accuracy
{.801 to .846} {.711 to .880}
Unweighted Inconclusives - -
.815 (.014) .759 (.053)
Sensitivity
{.789 to .842} {.655 to .864}
.832 (.019) .832 (.068)
Specificity
{.795 to .868} {.698 to .965}
.185 (.014) .241 (.053)
FN Errors
{.158 to .211} {.136 to .345}
.168 (.019) .168 (.068)
FP Errors
{.132 to .205} {.035 to .302}
D-INC - -
T-INC - -
.889 (.011) .904 (.041)
PPV
{.868 to .909} {.824 to .984}
.732 (.021) .623 (.075)
NPV
{.69 to .774} {.477 to .770}
.815 (.014) .759 (.053)
D Correct
{.789 to .842} {.655 to .864}
.832 (.019) .832 (.068)
T Correct
{.795 to .868} {.698 to .965}

Polygraph, 2011, 40(4) 276


Ad Hoc Committee on Validated Techniques

Appendix E-5
Directed Lie Screening Test (TES) / Seven-position TDA

Nelson, Handler,
Research Division Research Division Blalock &
Study Nelson (In press)
Staff (1995a) Staff (1995b) Hernández (In
press)
Sample N 94 85 100 49
N Deceptive 26 30 50 25
N Truthful 68 55 50 24
Scorers 3 10 1 2
D Scores 26 30 50 50
T Scores 68 55 50 48
Total Scores 94 85 100 98
Mean D - - -2.418 -1.833
StDev D - - 3.818 4.099
Mean T - - 2.653 3.670
StDev T - - 3.618 3.443
Reliability Kappa .760 - - -
Reliability Kappa .890 - - .722
Reliability Agreement - - - -
Unweighted Average
.788 .880 .874 .831
Accuracy

Unweighted
.155 .009 .096 .092
Inconclusives
Sensitivity .654 .833 .910 .583
Specificity .676 .909 .677 .940
FN Errors .154 .167 .037 .271
FP Errors .206 .073 .184 .020
D-INC .192 .001 .053 .145
T-INC .118 .018 .139 .039
PPV .761 .920 .832 .967
NPV .815 .845 .948 .776
D Correct .810 .833 .961 .683
T Correct .767 .926 .786 .979

277 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-6
Directed Lie Screening Test (TES) / ESS

Nelson, Handler,
Nelson & Handler Nelson, Handler & Blalock &
Study Nelson (In press)
(In press) Morgan (In press) Hernández (In
press)
Sample N 100 49 100 49
N Deceptive 50 24 50 25
N Truthful 50 25 50 24
Scorers 1 1 1 2
D Scores 50 24 50 50
T Scores 50 25 50 48
Total Scores 100 49 100 98
Mean D -2.442 -1.271 -3.031 -1.781
StDev D 3.531 3.131 5.104 3.437
Mean T 2.086 4.660 3.265 3.636
StDev T 3.460 2.299 3.935 2.917
Reliability Kappa - - - -
Reliability Kappa - .911 - .769
Reliability Agreement - - - -
Unweighted Average
.831 .854 .871 .859
Accuracy

Unweighted
.104 .088 .048 .123
Inconclusives
Sensitivity .917 .625 .935 .665
Specificity .587 .950 .730 .839
FN Errors .036 .210 .046 .207
FP Errors .253 .040 .195 .040
D-INC .047 .165 .020 .126
T-INC .160 .010 .075 .119
PPV .784 .940 .827 .943
NPV .942 .819 .941 .802
D Correct .962 .749 .953 .763
T Correct .699 .960 .789 .954

Polygraph, 2011, 40(4) 278


Ad Hoc Committee on Validated Techniques

Appendix E-7
Federal You-Phase / Seven-position TDA

Nelson, Handler,
Study Nelson (In press) Blalock & Cushman
(In press)
Sample N 100 22
N Deceptive 50 11
N Truthful 50 11
Scorers 1 10
D Scores 50 110
T Scores 50 110
Total Scores 100 220
Mean D -6.398 -7.991
StDev D 4.914 6.733
Mean T 5.485 6.514
StDev T 5.106 6.680
Reliability Kappa - -
Reliability Kappa - .852
Reliability Agreement - -
Unweighted Average
.870 .885
Accuracy
Unweighted
.301 .108
Inconclusives
Sensitivity .833 .844
Specificity .417 .730
FN Errors .010 .036
FP Errors .138 .171
D-INC .157 .119
T-INC .444 .097
PPV .858 .832
NPV .977 .953
D Correct .988 .959
T Correct .751 .810

279 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-8
Federal You-Phase / ESS

Nelson, Handler,
Study Nelson (In press) Blalock & Cushman
(In press)
Sample N 100 22
N Deceptive 50 11
N Truthful 50 11
Scorers 1 10
D Scores 50 110
T Scores 50 110
Total Scores 100 220
Mean D -6.685 -8.606
StDev D 6.881 5.842
Mean T 6.735 6.018
StDev T 6.045 7.107
Reliability Kappa - -
Reliability Kappa - 0.9
Reliability Agreement - -
Unweighted Average
.897 .906
Accuracy
Unweighted
.096 .235
Inconclusives
Sensitivity .813 .859
Specificity .729 .770
FN Errors .050 .027
FP Errors .126 .143
D-INC .090 .145
T-INC .102 .325
PPV .866 .857
NPV .936 .966
D Correct .942 .970
T Correct .853 .843

Polygraph, 2011, 40(4) 280


Ad Hoc Committee on Validated Techniques

Appendix E-9
Federal ZCT / Seven-position TDA

Honts, Amato &


Krapohl & Gordon (2004) as
Study Blackwell (1998)
Cushman (2006) reported in Grahnag
(2004)
Sample N 100 100 48
N Deceptive 65 50 24
N Truthful 35 50 24
Scorers 3 10 3
D Scores 195 500 72
T Scores 105 500 72
Total Scores 300 1,000 144
Mean D -10.385 -6.264 -8.420
StDev D 9.510 10.863 6.837
Mean T 6.981 9.776 6.640
StDev T 7.495 8.212 9.187
Reliability Kappa .570 - -
Reliability Kappa .800 - -
Reliability Agreement - - -
Unweighted Average
.793 .852 .958
Accuracy
Unweighted
.159 .198 .042
Inconclusives
Sensitivity .923 .824 .917
Specificity .448 .556 .917
FN Errors .015 .044 .000
FP Errors .295 .180 .083
D-INC .062 .132 .083
T-INC .257 .264 .001
PPV .758 .821 .917
NPV .967 .927 .999
D Correct .984 .949 .999
T Correct .603 .755 .917

281 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-10
Federal ZCT / Seven-position TDA with Evidentiary Rules

Krapohl & Nelson & Krapohl


Study
Cushman (2006) (2011)
Sample N 100 60
N Deceptive 50 30
N Truthful 50 30
Scorers 10 6
D Scores 500 30
T Scores 500 30
Total Scores 1,000 60
Mean D -6.264 -9.600
StDev D 10.863 7.356
Mean T 9.776 6.926
StDev T 8.212 10.709
Reliability Kappa - -
Reliability Agreement 0.870 -
Reliability Correlation - -
Unweighted Average
.872 .870
Accuracy
Unweighted Average
.073 .100
Accuracy
Sensitivity .792 .933
Specificity .824 .667
FN Errors .122 .001
FP Errors .116 .233
D-INC .086 .067
T-INC .060 .133
PPV .872 .800
NPV .871 .999
D Correct .999 .999
T Correct 0.88 .741

Polygraph, 2011, 40(4) 282


Ad Hoc Committee on Validated Techniques

Appendix E-11
Integrated Zone Comparison Technique / Horizontal Scoring System

Gordon, Mohamed,
Faro, Platek, Shurani & Chaves
Study Shurani (2011)
Ahmad & Williams (2010)
(2005)
Sample N 11 84 84
N Deceptive 6 44 36
N Truthful 5 40 48
Scorers 1 4 1
D Scores 6 44 36
T Scores 5 40 48
Total Scores 11 84 84
Mean D -36.000 -8.847 -19.667
StDev D 12.946 15.264 9.607
Mean T 8.750 21.181 28.948
StDev T 1.635 5.097 5.963
Reliability Kappa - - -
Reliability Agreement - - -
Reliability Correlation - - -
Unweighted Average
.999 .988 .999
Accuracy
Unweighted
.100 .061 .001
Inconclusives
Sensitivity .999 .955 .999
Specificity .800 .900 .999
FN Errors .001 .023 .001
FP Errors .001 .001 .001
D-INC .001 .023 .001
T-INC .200 .100 .001
PPV .999 .999 .999
NPV .999 .975 .999
D Correct .999 .977 .999
T Correct .999 .999 .999

283 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-12
Matte Quadri-Track Zone Comparison Technique

Matte & Reuss Shurani, Stein & Mangan, Armitage


Study
(1989) Brand (2009) & Adams (2008)
Sample N 122 57 140
N Deceptive 64 28 91
N Truthful 58 29 49
Scorers 2 4 1
D Scores 64 28 91
T Scores 58 29 49
Total Scores 122 57 140
Mean D -9.148 -6.949* -10.037*
StDev D 2.843 1.630* 2.995*
Mean T 3.099 5.388* 7.190*
StDev T 6.002 1.246* 3.189*
Reliability Kappa - - -
Reliability Agreement - - -
Reliability Correlation .990 - -
Unweighted Average
.999 .964 .999
Accuracy
Unweighted
.059 .001 .011
Inconclusives
Sensitivity .969 .929 .978
Specificity .914 .999 .999
FN Errors .001 .071 .001
FP Errors .001 .001 .001
D-INC .031 .001 .022
T-INC .086 .001 .001
PPV .999 .999 .999
NPV .999 .933 .999
D Correct .999 .929 .999
T Correct .999 .999 .999

Polygraph, 2011, 40(4) 284


Ad Hoc Committee on Validated Techniques

Appendix E-13
Utah PLC / Utah Numerical Scoring

Kircher & Raskin Honts, Raskin &


Study
(1988) Kircher (1987)
Sample N 100 20
N Deceptive 50 10
N Truthful 50 10
Scorers 1 1
D Scores 50 10
T Scores 50 10
Total Scores 100 20
Mean D -7.710 -14.000
StDev D 8.420 7.490
Mean T 10.785 9.600
StDev T 8.671 7.490
Reliability Kappa .730 .730
Reliability Agreement .990 .960
Reliability Correlation .970 -
Unweighted Average
.935 .889
Accuracy
Unweighted
.070 .150
Inconclusives
Sensitivity .880 .800
Specificity .860 .700
FN Errors .060 .000
FP Errors .060 .200
D-INC .060 .200
T-INC .080 .100
PPV .936 .800
NPV .935 .999
D Correct .936 .999
T Correct .935 .778

285 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-14
Utah DLC / Utah Numerical Scoring

Horowitz, Kircher,
Honts & Raskin
Study Honts & Raskin
(1988)
(1997)
Sample N 25 30
N Deceptive 12 15
N Truthful 13 15
Scorers 1 1
D Scores 12 15
T Scores 13 15
Total Scores 25 30
Mean D -11.500 -7.000
StDev D 5.803 13.500
Mean T 9.000 8.500
StDev T 5.803 11.500
Reliability Kappa - -
Reliability Agreement - -
Reliability Correlation .940 .920
Unweighted Average
.958 .856
Accuracy
Unweighted
.077 .067
Inconclusives
Sensitivity .917 .733
Specificity .846 .867
FN Errors .083 .133
FP Errors .001 .133
D-INC .001 .133
T-INC .154 .001
PPV .999 .846
NPV .910 .867
D Correct .917 .846
T Correct .999 .867

Polygraph, 2011, 40(4) 286


Ad Hoc Committee on Validated Techniques

Appendix E-15
Utah RCMP Zone / Utah Numerical Scoring

Honts Hodes & Driscoll, Honts &


Study Honts (1996)
Raskin (1985) Jones, (1987)
Sample N 38 32 40
N Deceptive 19 21 20
N Truthful 19 11 20
Scorers 1 1 1
D Scores 19 21 20
T Scores 19 11 20
Total Scores 38 32 40
Mean D -11.950 -15.000 -10.700
StDev D 6.520 5.564 6.000
Mean T 9.000 8.170 10.350
StDev T 10.660 5.270 6.470
Reliability Kappa .480 - -
Reliability Agreement .950 .930 -
Reliability Correlation .880 .910 .860
Unweighted Average
.833 .969 .999
Accuracy
Unweighted
.237 .210 .100
Inconclusives
Sensitivity .895 .714 .900
Specificity .421 .818 .900
FN Errors .001 .048 .001
FP Errors .211 .001 .001
D-INC .105 .238 .100
T-INC .368 .182 .100
PPV .810 .999 .999
NPV .999 .945 .999
D Correct .999 .938 .999
T Correct .667 .999 .999

287 Polygraph, 2011, 40(4)


Validated Techniques

Appendix E-16
Utah PLT DLC Combined / Utah Numerical Scoring

Technique Utah PLC Utah DLC RCMP


Sample N 120 55 110
N Deceptive 60 27 60
N Truthful 60 28 50
Scorers 2 2 3
D Scores 60 27 60
T Scores 60 28 50
Total Scores 120 55 110
Mean D -10.855 -9.250 -12.550
StDev D 7.955 9.652 6.028
Mean T 10.193 8.750 9.173
StDev T 8.081 8.652 7.467
Reliability Kappa .730 - .480
Reliability Agreement .975 - .940
Reliability Correlation .970 .930 .883
Unweighted Average
.927 .902 .939
Accuracy
Unweighted
.083 .073 .185
Inconclusives
Sensitivity .867 .815 .833
Specificity .833 .857 .700
FN Errors .050 .111 .017
FP Errors .083 .071 .080
D-INC .083 .074 .150
T-INC .083 .071 .220
PPV .912 .919 .912
NPV .943 .885 .977
D Correct .945 .880 .980
T Correct .909 .923 .897

Polygraph, 2011, 40(4) 288


Ad Hoc Committee on Validated Techniques

Appendix E-17
Zone Comparison Techniques / ESS

Nelson,
Nelson, Blalock, Handler,
Blalock, Nelson & Nelson
Krapohl & Cushman & Nelson,
Study Cushman & Krapohl et al.
Handler Nelson Goodson &
Oelrich (2011) (2011)
(2008) (2009) Hicks (2010)
(2011)
Sample N 100 10 60 562 100 100
N Deceptive 50 6 30 298 50 50
N Truthful 50 4 30 264 50 50
Scorers 7 25 6 74 9 19
D Scores 350 150 30 741 450 950
T Scores 350 100 30 641 450 950
Total Scores 700 250 60 1,382 900 1,900
Mean D -9.634 -10.740 -11.833 -11.354 -11.253 -7.953
StDev D 8.475 8.263 7.764 9.392 9.786 10.017
Mean T 8.849 8.690 6.000 7.373 7.191 11.212
StDev T 7.457 4.585 9.592 9.270 8.785 8.619
Reliability Kappa .610 - - - .560 .590
Reliability Agreement - .950 - - - -
Reliability Correlation - - - - - .840
Unweighted Average
.872 .958 .913 .883 .870 .867
Accuracy
Unweighted
.103 .102 .200 .114 .138 .115
Inconclusives
Sensitivity .749 .793 .833 .804 .773 .666
Specificity .814 .930 .633 .761 .727 .873
FN Errors .154 .073 .001 .090 .122 .196
FP Errors .077 .001 .133 .117 .102 .035
D-INC .097 .133 .167 .105 .104 .138
T-INC .109 .070 .233 .122 .171 .092
PPV .907 .999 .862 .873 .883 .950
NPV .841 .927 .999 .894 .856 .817
D Correct .829 .916 .999 .899 .864 .773
T Correct .914 .999 .826 .867 .877 .961

289 Polygraph, 2011, 40(4)


Validated Techniques

Appendix F

PDD Techniques for which no published studies could be located or for which no published
studies could be included in the meta-analysis.

● Backster Exploratory
● Backster SKY
● Backster ZCT
● IZCT Screening
● Law Enforcement Pre-Employment Test (LEPET)*
● Marcy Technique
● Matte Quinque Track
● Matte SGK
● RCMP B Series
● Searching Peak of Tension
● Utah MGQT*

* Although there is no published research to describe these techniques, the LEPET and Utah MGQT formats are
structurally nearly identical to the AFMGQT. Validation data for the AFMGQT can be generalized to these techniques
if they are evaluated with the TDA models described in the published studies.

Polygraph, 2011, 40(4) 290


Ad Hoc Committee on Validated Techniques

Appendix G
Studies that could not be included in the meta-analysis

These studies were excluded for a variety of reasons, including the use of instrumentation that
does not match that used in the field, the use of test question sequences or test data analysis
models that do not match field testing protocols, non-representative study samples, or insufficient
statistical data to calculate the criterion accuracy profile or sampling distributions.

Ansley (1989) produced a study on the effectiveness of Relevant-Irrelevant screening exams


with student examiners. Crewson (2001) included the results of this study in a survey of
diagnostic and screening tests in the polygraph, medicine and psychology professions.
However, the study report is not available, and the study could not be included in the
meta-analysis.

Barland, Honts and Barger (1989) studied the accuracy of multi-issue screening exams. No
reliability data and no sampling means or standard deviations were available. Data are not
available for further analysis. Numerical scoring was completed according to older training
and field practice protocols from the U.S. Department of Defense and may not reflect
current practices. Without the ability to compare these results to the results from other
studies, this series of studies could not be included in the meta-analysis.

Bell, Kircher and Bernhardt (2008) compared the Utah PLC and DLC methods, and
concluded no significant differences exist. TDA was limited to automated methods, and
therefore was not included in the meta-analysis.

Brownlie, Johnson, and Knill (1997) completed a study on the Relevant-Irrelevant


technique. The study is described in the report from the NRC (2003) and Crewson (2001).
Crewson (2001) included the results of this study in a survey of diagnostic and screening
tests in the polygraph, medicine and psychology professions. However, the study report is
not available, and the study could not be included in the meta-analysis.

Correa and Adams (1981) reported the results of a study of the RI technique. The testing
procedure does not match field practice, and includes the use of a thermistor respiration
sensor instead of standard thoracic and abdominal pneumograph sensors. In addition, an
EKG was used in place of the cardiograph sensors. Results, as reported, were 100%
accuracy. No data could be obtained for review.

Forman and McCauley (1986) reported the results of a study on the positive-control
technique. Reported results did not provide mean or standard deviation statistics for the
distributions of scores. Additionally, the study employed a quasi-numerical scoring system
that does not reflect PCT field practice as described in other studies.

Ganguly, Lahri and Bhaseen (1986) reported the results of a study based on the Reid
Technique. However, testing procedures do not reflect field practices in that only the
pneumograph and cardiograph data were evaluated during the study.

Ginton, Daie, Elaad and Ben-Shakhar (1982) reported the results of an interesting field
study involving a unique modification of the MGQT technique, using an overall truth
question at position 3. This study was not included because this question sequence does
not reflect field practices.

Gordon, Fleisher, Morsie, Habib, and Salah (2000) reported the results of a field study of
the IZCT, including 309 reported confirmed cases and 1 error. No reliability data was
included in the publication, nor any statistical description of the sampling distributions of

291 Polygraph, 2011, 40(4)


Validated Techniques

deceptive and truthful scores. The authors advised that the data belong to the intelligence
service of a foreign government, and the primary author informed the ad hoc committee
(personal communication June 10, 2011) that he completed the study report without ever
seeing the data.

Honts and Amato (1999) reported the results of an automated presentation of the Relevant-
Irrelevant polygraph technique. This procedure does not reflect field practices, and the
information could therefore not be included in the meta-analysis.

Honts and Hodes (1983) reported the same results as experiment 1 in the Honts, Hodes
and Raskin (1985) study, which did not include a standard cardiovascular arm cuff.

Honts, Hodes and Raskin (1985) experiment 1 involved the Backster You-Phase technique,
but did not include a standard cardio cuff and therefore could not be included in the meta-
analysis.

Honts and Reavy (2009) used the Federal ZCT in a large-scale laboratory experiment.
However, this study was designed to evaluate and compare the effectiveness of PLCs and
DLCs. Data as reported could not be used to calculate a dimensional profile of
generalizable criterion accuracy estimates.

Horowitz, Kircher, Honts and Raskin (1997) included the results of a study of the RI
techniques. RI examination results could not be included because the scoring protocol did
not reflect field practices (global analysis and subjective evaluation of consistent and
significant responses), and involved the use of seven-position numerical scoring procedures
in which the RQ responses were compared with the responses to neutral questions.

Horvath (1988) reported the results of a study on the Reid technique, involving two blind
evaluators who used a 7-position scoring model described as similar to the one used by
Barland and Raskin (1975) with +/- 5 decision thresholds. This is not the Reid scoring
method, which has been described as a three position TDA model that does not employ
fixed cut scores. They also included a fifth RQ which is not consistent with current field
practices.

Horvath and Palmatier (2008) reported the results of an MGQT format structured like the
Reid technique. The study involved two blind evaluators who used a 7-position scoring
model described as similar to the one used by Barland and Raskin (1975) with +/- 6
decision thresholds. This is not the Reid scoring method, which has been described as a
three position TDA model that did not have cut scores. They also included a fifth RQ which
is not consistent with field practices. Additionally, there was only one evaluator so
reliability statistics could not be calculated.

Horvath and Reid (1971) reported the results of a study on the Reid technique. The study
could not be included in the meta-analysis for several reasons, including a highly selective
sample and the lack of a clearly structured decision model. Of the original 75 polygraph
exams, 40 were chosen by the author for inclusion in the study sample, and 35 exams were
removed from the sample because the author felt they were too easy to score. The act of
selecting out exams from individual case files which they felt were not appropriate for the
study brings into question whether the resulting sample would be representative of real
world cases. The examiners were precluded from making an inconclusive call, and were
required to make a DI or NDI call for every case. Some cases had five RQs, which is
inconsistent with field testing techniques currently used. No mean and standard deviation
scores were provided and no data could be provided to calculate them. Additionally, no
interrater reliability statistics were reported.

Polygraph, 2011, 40(4) 292


Ad Hoc Committee on Validated Techniques

Hunter and Ash (1973) reported the results of a study of the Reid technique. The Reid
technique does not employ a structured decision model or fixed cutscores, but relies on
impressionistic decisions from the examiner. The report does not include any inter-rater
reliability information and does not include any sampling distribution data that can be
used to compare the sampling distributions. Additionally, the study sample was
constructed using a highly selective process involving verified cases conducted by the
primary author. Without data on reliability and without sampling distribution parameters,
or access to the examination data, the sample is of is of unverifiable representativeness and
generalizability.

Jayne (1989) reported the results of a field study on the predictive value of polygraph
screening tests. The study design was intended to compare screening polygraph accuracy
with that of other preemployment screening methods. In addition to highly selective and
non-random case selection requirements, the criterion status of the sample screening cases
was determined as a function of the results of a subsequent diagnostic polygraph regarding
an employee theft investigation that was independent of the screening polygraph. Although
an innovative attempt to study future outcomes related to polygraph screening, the results
of this study are not suitable for use as a criterion accuracy study.

Jayne (1990) reported the result of a study of the Reid technique. Although the results
were previously described by Krapohl (2006), this study could not be included in the
present meta-analysis because two different scoring approaches were used, neither of
which reflect field practices. One scoring method involved making precise linear
measurements of the test data. The other, more common, numerical scoring method was
completed while excluding the fourth RQ, which was described as a secondary RQ that
could also be scored as a CQ. A test for which the nature and purpose of the stimulus is
decided post-hoc lacks scientific rigor. Current field practices do not endorse the exclusion
of individual RQs or the use of a secondary RQ as a CQ. In addition, no decision rules or
numerical cutscores were used in this study, and decisions were made according to the
subjective opinion of the scorer.

Krapohl (2005) reported the results of seven-position evidentiary scoring of Federal ZCT
exams. Study data included cases from four different archival samples for which the
sampling distributions could not be effectively compared to the sampling distributions from
other studies.

Krapohl (2010) showed that seven-position scores of You-Phase exams could be


transformed to ESS scores. Sample data was a highly selective and non-representative
sample of Backster You-Phase exams, reported by Meiron, Krapohl & Ashknazi (2008).

Matte (1990) is an abstract of a dissertation study completed at an institution that was


approved by the State of California but not accredited by an institution recognized by the
U.S. Government or foreign equivalent. UMI later became part of ProQuest who
subsequently adopted a policy that only those dissertations from regionally accredited
universities would be listed and available to the public. The dissertation is maintained by
ProQuest with Matte as the author and ProQuest as the publisher. Data from this
dissertation study was included in this meta-analysis because the data was previously
published in the journal Polygraph (Matte & Ruess, 1989).

Matte (2010) reported the results of a study of the Backster Either-Or rule. However, no
truthful cases were included in the sample data. This study cannot be construed as a
criterion accuracy study, and could not be included in the meta-analysis.

Meiron, Krapohl and Ashkenazi (2008) reported the results of a study of the Backster
Either-Or rule. Data, as reported at the 2008 APA annual conference in Indianapolis,
included a highly selective and non-random sample from which the results of problematic

293 Polygraph, 2011, 40(4)


Validated Techniques

examinations were not included. The result of this form of highly selective sampling is that
the sample data are systematically devoid of error variance.

Patrick and Iacono (1991) reported the results of a study of the Federal ZCT while reviewing
the CQs between charts. Instrumentation recorded both skin conductance and heart rate,
which were scored against pre-stimulus levels in a manner that does not reflect field
practices.

Patrick and Iacono (1989) in a replication of an earlier study by Raskin and Hare (1978)
reported the results of a study involving a technique that resembles the Federal ZCT, using
the Utah scoring system and a grand total decision rule. This study is interesting but does
not resemble field practices closely enough to be included in the present meta-analysis of
criterion accuracy.

Podlesny and Raskin (1978) recorded eight different physiological channels using a test
question sequence that resembles the Federal ZCT with a guilt-complex question instead of
the symptomatic question at position 8. Numerical scoring was based on the method
described by Barland and Raskin (1975) and Raskin and Hare (1978) which is an early
version of the Utah numerical scoring system. Guilt complex questions and differences
between exclusive and non-exclusive CQs were studied. In addition, the CQT was
compared to the GKT. The absence of standard deviations prevents the comparison of the
sampling distribution of numerical scores with those from other studies. This study was
designed to evaluate a number of important questions, but could not fit into a clear
category of test question sequence and TDA model for which comparable replication studies
could be located. As a result, this study could not be included in the meta-analysis of
criterion accuracy of PDD techniques presently used in field settings.

Raskin and Hare (1978) reported the results of an important study using examinees who
were considered to be criminal psychopaths. Results from this study could not be included
in the meta-analysis criterion accuracy of field PDD methods because of the use of non-
standard testing instrumentation that did not include a standard blood-pressure cardio-
activity cuff sensor.

Research Division Staff (2001) described a study used on the development of a laboratory
scenario used to manipulate research subjects. This study was not designed to be a
criterion accuracy study.

Rovner (1986) in an interesting study of polygraph countermeasures did not report


decisions, errors and inconclusive results separately for truthful and deceptive cases. Data
are unavailable for further analysis. As a result, the complete dimensional profile of
criterion accuracy could not be calculated and the study could not be included as a
criterion accuracy study.

Senter and Dollins (2002) published an interesting and informative study of decision rules
that is not suitable for use as a criterion study.

Senter (2003) published an interesting and informative study of decision rules that is not
suitable for use as a criterion study.

Senter and Dollins (2004) published an interesting and informative study of decision rules
that is not suitable for use as a criterion study.

Senter and Dollins (2008) published an interesting and informative study of decision rules
that is not suitable for use as a criterion study.

Polygraph, 2011, 40(4) 294


Ad Hoc Committee on Validated Techniques

Slowik and Buckley (1975) reported the results of a study of the Reid technique. The study
sample was selected from verified cases but does not describe the verification process. The
Reid technique does not employ a structured decision model or fixed cutscores, but relies
on impressionistic decisions from the examiner. The report does not include any inter-rater
reliability information and does not include any sampling distribution parameters that can
be used to compare the sampling distributions. Without sampling distribution statistics, or
access to the examination data, the sample is of unverifiable representativeness and
unknown generalizability.

Van Herk (1990) reported the results of a pilot study and what may be the first publication
on three-position TDA. One-third of the cases were unconfirmed, and consequently, this
study could not be included in the meta-analysis.

Wicklander and Hunter (1975) reported the results of a study of the Reid technique. The
study sample was selected from verified cases but does not describe the verification
process. The Reid technique does not employ a structured decision model or fixed
cutscores, but relies on impressionistic decisions from the examiner. The report does not
include any inter-rater reliability information and does not include any sampling
distribution data that can be used to compare the sampling distributions. Without
sampling distribution statistics, or access to the examination data, the sample is of
unverifiable representativeness and unknown generalizability.

295 Polygraph, 2011, 40(4)


Validated Techniques

Appendix H
Techniques for which there exists only one study that met the qualitative and quantitative
criteria for inclusion in the meta-analysis.

The APA Standards of Practice require a minimum of two published studies of a given technique.

Arther Technique

Horvath, F.S. (1977) reported the results of a field study using the Arther modification of the Reid
technique.

Reid Technique

Horvath (1988) reported the results of a laboratory study using the Reid technique.

These studies were previously combined by Krapohl (2006). However, subsequent review suggests
that the Reid and Arther techniques are sufficiently dissimilar that the results of these studies cannot
be regarded as replicating each other.

The Reid Technique differs from other CQT formats in some important ways. It sometimes employs
a fifth RQ while all other PDD techniques are limited to four RQs. Also, the Reid technique does
not employ fixed numerical cutscores or a structured decision model. Instead, decisions are made
impressionistically by the examiner, using information from the test data, pretest interview,
behavioral observations, and case file information. Inclusion of clinical impressions in the decision
process severely restricts the ability to validate it in the way that has been done with the other
techniques. Although the Reid technique can be credited as the source of many important
innovations, and is indeed the wellspring from which all other CQTs have evolved, it is currently
not being taught at any polygraph schools accredited by the APA. The number of practitioners
using the technique has decreased substantially since the closure of the Reid Polygraph School.
The majority of the other techniques in the meta-analysis are currently taught and/or practiced on
a considerable scale. A review of the research in support of the technique resulted in an inability
to satisfy the requirements of the study selection criteria, including interrater reliability statistics
and normative parameters with which to calculate the generalizability of sample data. Results
described in the published literature, and data available to the committee, do not permit the
statistical treatments applied to all of the other methods. Despite these limitations, the average
accuracy level of studies on the Reid technique was not significantly different from the results of
this meta-analysis.

Polygraph, 2011, 40(4) 296


Ad Hoc Committee on Validated Techniques

Appendix I-1
AFMGQT / Three-position TDA

Two usable studies on the AFMGQT with three-position TDA produced a combined unweighted
average accuracy level of .816 (.059), with an inconclusive rate of .443 (.044). The weighted
average of correct decisions for the AFMGQT three-position model, was .989 (.016) for criterion
deceptive cases and .643 (.116) for criterion truthful cases. The weighted average inconclusive
rates were .237 (.060) for criterion deceptive cases and .648 (.067) for criterion truthful cases.

Nelson & Handler Handler & Nelson


Study
(In press) (In press)
Sample N 100 22
N Deceptive 50 11
N Truthful 50 11
Scorers 3
D Scores 50 33
T Scores 50 33
Total Scores 100 66
Mean D -1.886 -1.903
StDev D 3.161 2.986
Mean T 2.427 1.424
StDev T 2.557 2.722
Reliability Kappa - -
Reliability Agreement - .945
Reliability Correlation - -
Unweighted Accuracy .869 .740
Unweighted Inconclusives .457 .421
Sensitivity .737 .780
Specificity .260 .180
FN Errors .007 .010
FP Errors .088 .186
D-INC .256 .209
T-INC .658 .633
PPV .894 .807
NPV .974 .947
D Correct .991 .987
T Correct .748 .492

297 Polygraph, 2011, 40(4)


Validated Techniques

Appendix I-2
Army MGQT / Seven-position TDA

Two usable studies on the Army MGQT resulted in an unweighted accuracy level of .694 (.043),
with an inconclusive rate of .133 (.038). The weighted average of correct decisions for the Army
MGQT, using the seven-position TDA model, was .999 (.050) for criterion deceptive cases and .039
(.085) for criterion truthful cases. The weighted average inconclusive rates were .043 (.034) for
criterion deceptive cases and .224 (.065) for criterion truthful cases.

Krapohl & Norris Blackwell


Study
(2000) (1999)
Sample N 32 100
N Deceptive 16 80
N Truthful 16 20
Scorers 3 3
D Scores 16 240
T Scores 16 60
Total Scores 32 300
Mean D - -
StDev D - -
Mean T - -
StDev T - -
Reliability Kappa - -
Reliability Agreement - -
Reliability Correlation .750 .907
Unweighted Accuracy .833 .660
Unweighted Inconclusives .219 .125
Sensitivity .813 .967
Specificity .500 .250
FN Errors .001 .001
FP Errors .250 .533
D-INC .188 .033
T-INC .250 .217
PPV .765 .644
NPV .999 .999
D Correct .999 .999
T Correct .667 .319

Polygraph, 2011, 40(4) 298


Ad Hoc Committee on Validated Techniques

Appendix I-3
Directed-Lie Screening Test / Three-position TDA

Two usable studies on the DLST/TES with three-position TDA produced a combined unweighted
average accuracy level of .869 (.037) with an inconclusive rate of .228 (.043). The weighted average
of correct decisions for the DLST three-position model, was .827 (.060) for criterion deceptive cases
and .912 (.043) for criterion truthful cases. The weighted average inconclusive rates were .427
(.063) for criterion deceptive cases and .210 (.060) for criterion truthful cases.

Nelson, Handler,
Blalock &
Study Nelson (In press)
Hernández
(In press)
Sample N 100 49
N Deceptive 50 25
N Truthful 50 24
Scorers 1 2
D Scores 50 50
T Scores 50 48
Total Scores 100 98
Mean D -1.585 -1.458
StDev D 2.382 2.784
Mean T 1.719 2.470
StDev T 2.253 1.853
Reliability Kappa - -
Reliability Agreement - 0.762
Reliability Correlation - -
Unweighted Accuracy .893 .817
Unweighted Inconclusives .208 .248
Sensitivity .829 .415
Specificity .595 .848
FN Errors .033 .228
FP Errors .127 .009
D-INC .138 .355
T-INC .277 .141
PPV .867 .979
NPV .947 .788
D Correct .962 .645
T Correct .824 .989

299 Polygraph, 2011, 40(4)


Validated Techniques

Appendix I-4
Federal You-Phase / Three-position TDA

Two usable studies on the Federal You-Phase technique with three-position TDA produced
combined unweighted average accurate level of .881 (.041), with an inconclusive rate of .282
(.046). The weighted average of correct decisions for the Federal You-Phase three-position model,
was .977 (.023) for criterion deceptive cases and .786 (.078) for criterion truthful cases. The
weighted average inconclusive rates were .181 (.055) for criterion deceptive cases and .348 (.072)
for criterion truthful cases.

Nelson, Handler,
Nelson
Study Blalock & Cushman
(In press)
(In press)
Sample N 100 22
N Deceptive 50 11
N Truthful 50 11
Scorers 1 10
D Scores 50 110
T Scores 50 110
Total Scores 100 220
Mean D -4.720 -5.394
StDev D 3.345 4.219
Mean T 3.901 3.982
StDev T 3.804 4.432
Reliability Kappa - -
Reliability Agreement - 0.877
Reliability Correlation - -
Unweighted Accuracy .889 .878
Unweighted Inconclusives .387 .235
Sensitivity .740 .826
Specificity .380 .530
FN Errors .001 .027
FP Errors .107 .143
D-INC .260 .145
T-INC .513 .325
PPV .874 .852
NPV .997 .952
D Correct .999 .968
T Correct .780 .788

Polygraph, 2011, 40(4) 300


Ad Hoc Committee on Validated Techniques

Appendix I-5
Federal ZCT / Three-position TDA (+/-6)

Two usable studies on the Federal ZCT with three-position TDA, using tradition cutscores (+/-6
and -3) produced a combined unweighted average accuracy level of .883 (.048), with an
inconclusive rate of .318 (.043). The weighted average of correct decisions for the Federal ZCT
three-position model with traditional cutscores, was .990 (.016) for criterion deceptive cases and
.675 (.095) for criterion truthful cases. The weighted average inconclusive rates were .158 (.050)
for criterion deceptive cases and .477 (.069) for criterion truthful cases.

Capps & Ansley


Study Blackwell (1998)
(1992)
Sample N 100 100
N Deceptive 52 65
N Truthful 48 35
Scorers 1 3
D Scores 52 195
T Scores 48 105
Total Scores 100 300
Mean D -9.640 -5.938
StDev D 5.146 5.503
Mean T 4.780 4.867
StDev T 5.461 5.580
Reliability Kappa - .360
Reliability Agreement - .660
Reliability Correlation - -
Unweighted Average Accuracy .977 .779
Unweighted Inconclusives .386 .293
Sensitivity .769 .851
Specificity .438 .314
FN Errors .001 .010
FP Errors .021 .238
D-INC .231 .138
T-INC .542 .448
PPV .974 .781
NPV .999 .968
D Correct .999 .988
T Correct .955 .569

301 Polygraph, 2011, 40(4)


Validated Techniques

Appendix I-6
Federal ZCT / Three-position TDA (+/-4)

Two usable studies on the Federal ZCT with three-position TDA, using improved cutscores (+/-4
and -3) produced a combined unweighted average accuracy level of .939 (.028), with an
inconclusive rate of .269 (.044). The weighted average of correct decisions for the Federal ZCT
three-position model with improved cutscores, was .932 (.042) for criterion deceptive cases and
.946 (.035) for criterion truthful cases. The weighted average inconclusive rates were .314 (.066)
for criterion deceptive cases and .225 (.059) for criterion truthful cases.

Study Krapohl (1998) Harwell (2000)


Sample N 100 88
N Deceptive 50 60
N Truthful 50 28
Scorers 5 3
D Scores 250 180
T Scores 250 84
Total Scores 500 264
Mean D -7.700 -6.194
StDev D 5.876 5.576
Mean T 6.300 5.113
StDev T 5.317 5.521
Reliability Kappa - -
Reliability Agreement - .990
Reliability Correlation .900 -
Unweighted Accuracy .929 .937
Unweighted Inconclusives .264 .296
Sensitivity .604 .689
Specificity .768 .631
FN Errors .068 .017
FP Errors .032 .071
D-INC .328 .294
T-INC .200 .298
PPV .950 .906
NPV .919 .974
D Correct .899 .976
T Correct .960 .898

Polygraph, 2011, 40(4) 302


Ad Hoc Committee on Validated Techniques

Appendix I-7
Positive Control Technique / Seven-position TDA*

Two studies on the Positive Control Technique produced a combined unweighted average accuracy
level of .820 (.043), with an inconclusive rate of .292 (.045). The weighted average of correct
decisions for the Positive Control Technique, was .679 (.080) for criterion deceptive cases and .962
(.032) for criterion truthful cases. The weighted average inconclusive rates were .333 (.066) for
criterion deceptive cases and .250 (.063) for criterion truthful cases.

Driscoll, Honts & Forman &


Study
Jones (1987) McCauley (1986)
Sample N 40 38
N Deceptive 20 22
N Truthful 20 16
Scorers 1 1
D Scores 20 22
T Scores 20 16
Total Scores 40 38
Mean D -2.000 -
StDev D 3.800 -
Mean T 6.600 -
StDev T 5.700 -
Reliability Kappa - -
Reliability Agreement - .800
Reliability Correlation .840 .800
Unweighted Accuracy .889 .777
Unweighted Inconclusives .450 .131
Sensitivity .350 .545
Specificity .650 .750
FN Errors .100 .318
FP Errors .001 .063
D-INC .550 .136
T-INC .350 .125
PPV .999 .897
NPV .867 .702
D Correct .778 .632
T Correct .999 .923

* Driscoll, Honts & Jones (1987) used seven-position numerical scoring. Forman & McCauley used
a non-numerical TDA approach, and is included here only for comparison purposes.

303 Polygraph, 2011, 40(4)


Validated Techniques

Appendix I-8
Relevant-Irrelevant Technique

One usable published study exists for the Relevant-Irrelevant techniques (Krapohl, Senter & Stern,
2005), which resulted in an unweighted accuracy rate of .732 (.044). Krapohl (2006) previously
reported the accuracy of the RI technique at .83 with zero inconclusives, while including the
results of Correa and Adams (1981) who reported an accuracy level of 100% in a laboratory study
that employed methodology that does not reflect field practices.

Krapohl, Senter &


Study
Stern (2005)
Sample N 100
N Deceptive 59
N Truthful 41
Scorers 1
D Scores 59
T Scores 41
Total Scores 100
Mean D -
StDev D -
Mean T -
StDev T -
Reliability Kappa -
Reliability Agreement .700
Reliability Correlation -
Unweighted Accuracy .732
Unweighted Inconclusives .001
Sensitivity .831
Specificity .634
FN Errors .169
FP Errors .366
D-INC .001
T-INC .001
PPV .694
NPV .789
D Correct .831
T Correct .634

Polygraph, 2011, 40(4) 304


Ad Hoc Committee on Validated Techniques

Appendix I-9
Zone Comparison Techniques / Rank Order Scoring System

Two usable studies on the ZCT with Rank Order TDA produced a combined unweighted average
accuracy level of .886 (.036), with an inconclusive rate of .213 (.042). The weighted average of
correct decisions for the Zone Comparison Technique with Rank Order TDA, was .885 (.050) for
criterion deceptive cases and .887 (.053) for criterion truthful cases. The weighted average
inconclusive rates were .020 (.061) for criterion deceptive cases and .225 (.057) for criterion
truthful cases.

Krapohl, Dutton & Honts & Driscoll


Study
Ryan (2001) (1987)
Sample N 100 60
N Deceptive 50 30
N Truthful 50 30
Scorers 3 2
D Scores 50 30
T Scores 50 30
Total Scores 100 60
Mean D -24.650 -20.900
StDev D 18.970 8.660
Mean T 7.400 13.400
StDev T 17.060 8.660
Reliability Kappa - -
Reliability Agreement - -
Reliability Correlation - .930
Unweighted Accuracy .879 .906
Unweighted Inconclusives .150 .317
Sensitivity .720 .600
Specificity .720 .633
FN Errors .120 .033
FP Errors .080 .100
D-INC .100 .367
T-INC .200 .267
PPV .900 .857
NPV .857 .950
D Correct .857 .947
T Correct .900 .864

305 Polygraph, 2011, 40(4)


Attachment B: Steyn - polygraph report April 2019
Nelson

Scientific Basis for Polygraph Testing


Raymond Nelson

Abstract
Published scientific literature is reviewed for comparison question polygraph testing and its
application to diagnostic and screening contexts. The review summarizes the literature for all
aspects of the testing procedure including the pretest interview, test data collection, test data
analysis, and a proffer of the physiological and psychological basis for polygraph testing. Polygraph
accuracy information is summarized for diagnostic and screening exams. Evidence is reviewed for
threats to polygraph accuracy and the contribution of polygraph results to incremental validity or
increased decision accuracy by professional consumers of polygraph test results. The polygraph is
described as a probabilistic and non-deterministic test, involving both physiological recording and
statistical methods. Probabilistic tests, statistical models, and scientific tests in general are needed
when neither deterministic observation nor physical measurement are possible. Event-specific
diagnostic polygraphs have been shown to provide mean accuracy of .89 with a 95% confidence
range from .83 to .95. Multi-issue screening polygraphs have been shown to provide accuracy
rates, with a mean of .85 and a 95% confidence range of .77 to .93.

Keywords: Polygraph, lie detection, signal detection, test data analysis, scientific basis.

Polygraph examinations, like other for the selection of a testing technique that
scientific and forensic tests, can take the provides something less than the highest
form of either diagnostic test or screening achievable level of diagnostic accuracy.
tests. The difference between diagnostic and Diagnostic tests achieve high levels of
screening exams is that diagnostic decision accuracy, in part, by restricting to
examinations involve the existence of a the test to a single issue of concern.
known problem, in the form of symptoms,
evidence, allegations, or incidental In contrast, screening tests are
circumstances that suggest an individual intended to add incremental validity to risk
may have some involvement, for which the management decisions that are made in the
examination results are intended to support a absence of any known problem. This is
positive or negative diagnostic conclusion. accomplished both by gathering information
Screening tests include all tests conducted in and by investigating the possible involvement
the absence of a known incident, known of an individual in one or more issues of
allegation, or known problem. concern. Screening tests should not be used
alone as the basis for action that may affect
The purpose of diagnostic tests is to an individual´s rights, liberties, or health.
form a conclusion that may serve as a basis Absence of any known problem is the
for action. This action will often affect the defining characteristic of a screening test
future of an individual in term of rights, (Wilson, & Jungner, 1968; Raffle, & Muir
liberties or health. For this reason, it is Gray, 2007). Screening polygraphs tests
difficult to imagine an ethical justification address the objective of adding incrementally

Author note: Raymond Nelson is a polygraph examiner and psychotherapist who has authored numerous papers
and studies on many aspects of the polygraph test. Mr. Nelson is a researcher associated with the Lafayette
Instrument Company, is the curriculum director of an accredited polygraph training program, and is an elected
member of the Board of Directors of the American Polygraph Association - currently serving in the role of President.
Mr. Nelson has taught at several polygraph training programs, is an active speaker and presenter at both
international conferences, and has been an expert witness in several court jurisdictions regarding both polygraph
and psychotherapy matters. The views and opinions expressed in this publication are those of the author and not
necessarily those of the Lafayette Instrument Company or the American Polygraph Association. Inquiries and
correspondence can be sent to raymond.nelson@gmail.com.

28 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

to risk management through a combination that describe their possible involvement in


of smaller goals that may include both the problematic behaviors, while also sensitizing
discriminate ability of the test result, and the or increasing the awareness and response
capability of the testing process to develop potential of deceptive examinees to test
information. Polygraph screening programs questions that describe their past behavior.
can also involve a third goal in the form of
increased deterrence of problems (American The polygraph pretest interview is a
Polygraph Association, 2009a; 2009b). process, involving several steps (American
Deterrence objectives may be achieved by Polygraph Association, 2009a, 2009b;
deterring higher-risk persons from access to Department of Defense Polygraph Institute,
high-risk environments (e.g., police applicant 2002), including: a free-narrative interview
screening or government/operational security (Powell & Snow, 2007), semi-structured
screening), or through dissuasion of (or interview (Lindlof & Taylor, 2002), or
decreased) non-compliance with policies, structured interview (Drever, 1995), a
rules, and regulations (e.g., operational thorough review of the test question stimuli,
security policies). and a practice or orientation test. The first
objective of the pretest interview is to
Discussion establish a positive identification and
introduction, and to clarify the roles of the
According to the American Polygraph examiner and examinee. The examiner will
Association (2011), polygraph examinations also introduce the examinee to the
consist of three phases: 1) a pretest interview, examination room, including the use of audio
2) an in-test data collection phase, and 3) test or video recording devices, and all of the
data analysis. Each of these phases has an polygraph sensors that will later be attached
important effect on both test accuracy and to the examinee.
the usefulness of the test result. For this
reason, all assumptions and procedures The next stage of the process consists
considered fundamental to the polygraph test of making an initial determination of the
should ideally be based on generally accepted suitability of the examinee and obtaining
knowledge or evidence and theoretical informed consent for testing. This is done
constructs for which there exists published after a review of the rights of the examinee
and replicated empirical support. during testing, including the right to
terminate the examination at any time.
Polygraph pretest interview Ideally, informed consent should also include
information about who will receive the
At its most basic, an interview is information and results from the
merely a conversation with a purpose examination, and where to obtain more
(Hodgson, 1987), and, as indicated by Kahn information about the strengths and
and Cannel (1957), the success of many limitations of the polygraph procedure. The
professional endeavors depends in part on examiner will then engage the examinee in a
the ability to get information from others. The brief discussion about the case background
polygraph pretest interview is intended to and personal background of the examinee, in
orient the examinee to the testing procedures, order to continue to establish an adequate
the purpose of the test, and the investigation and suitable testing rapport. The examiner
target questions. The basic premise of will also provide more information about the
interviewing holds that people will report psychological and physiological basis for the
more useful information when they are polygraph test, and will provide answers to
prompted to do so by an interested listener any questions the examinee may have
who builds rapport through the use of regarding the testing procedures.
conversation and interview questions.
Polygraph pretest interviews are intended to A practice test or acquaintance test
allow truthful examinees to become should be conducted as part of standardized
accustomed to - or habituated to - the field practice (American Polygraph
cognitive and emotional impact of hearing Association, 2009a, 2009b; Department of
and responding to test stimulus questions Defense, 2006a). The purpose of this test is to

Polygraph, 2015, 44(1) 29


Nelson

orient the examinee to the testing procedure screening exams, whether pertaining to
before commencing the actual examination. operational security, law enforcement pre-
Research by Kircher, Packard, Bell & employment, or post-conviction supervision,
Bernhardt (2001) has shown that this can will take the form of either a structured
contribute to increased test accuracy. A interview or semi-structured interview.
scientific view, supported by recent studies
involving non-naive examinees who are fully Structured interviews differ from
aware of the details of the testing procedures semi-structured interview in that structured
(Honts & Reavy, 2009; Honts & Alloway, interviews are conducted verbatim, without
2007; Nelson, Handler, Blalock & Hernandez, deviation from the interview protocol (General
2012; Rovner, 1986), holds that the Accounting Office, 1991; Campion, Campion,
effectiveness of evidence-based scientific tests & & Hudson, 1994; Kvale, 1996). In contrast,
is not dependent on the examinee's belief semi-structured interviews are conducted
system. The purpose of the acquaintance test using a structured content and question
is not to demonstrate or convince the outline, for which the interviewer is permitted
examinee to believe that the polygraph test is to present interview questions in a manner
infallible, but to orient the examinee to the that is individualized based on the
testing procedures. Regardless of the personalities, education levels, and rapport
examiner's and examinee's attitudes or between the interviewer and interviewee.
beliefs concerning the acquaintance test, Although structured interviews may be
scientific studies (Bradley & Janisse, 1981; preferred by some researchers and program
Horneman & O'Gorman, 1985; Horowitz, administrators for their consistency,
Kircher & Raskin, 1986; Kirby, 1981; Widup, structured interviews make little use of the
R, Jr & Barland, 1994) have shown that the skill and expertise of the interviewer.
use of an acquaintance test does not harm
and may at times increase the accuracy of Semi-structured interviews are
the polygraph examination result. The actual intended to make more effective use of
reason for this effect may have more to do interviewer skill and expertise to access rich
with ensuring that the instrument and information regarding the interview content.
sensors are adjusted and functioning Like structured interviews, semi-structured
adequately and that the examinee has had an interviews should be anchored by a defined
opportunity to practice complying with interview schedule or interview protocol, with
behavioral instructions. clearly formulated operational definitions that
describe the behavioral issues of concern.
The next stage of the pretest interview Compared to structured interview methods,
will be a free-narrative interview, a structured semi-structured interview strategies both
interview or semi-structured interview. Free- depend on and foster greater interviewing
narrative interviews are characterized by the skill. Like structured interview methods,
use of simple and common language, an semi-structured interview protocols require
absence of coercive techniques, an that all interview topics and questions are
opportunity for the interviewee to addressed at some point during an interview.
communicate details at the level of one's own
choosing, along with encouragement to In the last stage of the pretest
elaborate. Free narrative interviews interview – following the free-narrative
conducted during polygraph testing may interview or semi-structured interview – the
include direct or probing questions regarding examiner will develop and review the test
a known or alleged incident, before questions with the examinee (American
proceeding to construct polygraph test Polygraph Association, 2009a, 2009b;
questions. Free-narrative interview strategies Department of Defense, 2002). Test question
are useful during diagnostic investigations, language will be adjusted to ensure correct
but are not well suited toward use in understanding and to account for information
polygraph screening tests which are or admissions that the examinee may provide
conducted in the absence of a known or during the interview or while developing the
alleged incident. Pretest interviews for test questions. Relevant questions will
screening exams conducted during polygraph describe the possible behavioral involvement

30 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

of the examinee in the issue or issues of diagnostic polygraphs are constructed with
concern. These questions will generally avoid the assumption of non-independent criterion
issues related to memory, intent, and variance. The scientific and probabilistic
motivation. However, some investigative meaning of this is that the RQs have a
testing protocols will allow for test questions common or shared source of response
regarding memory if an examinee admits the variance because the external criterion states
alleged behavioral act and the issue of of different RQs may (and do) affect one
memory or motivation is target of the another. The practical meaning of this is that
investigation (American Polygraph all RQs must address behavior within a single
Association, 2009a). incident of concern.

When a polygraph examination Multi-issue screening polygraphs,


consists of multiple series of test questions, conducted in the absence of a known
the examiner will review each series of allegation or incident, may be constructed
questions separately, then conduct the in- with relevant questions that describe distinct
test data collection phase for each question behaviors for which the external criterion
series questions before reviewing and states are assumed to vary independently
collecting data for each subsequent question (i.e., external criterion states are assumed to
series. When a polygraph consists of multiple be exclusive or not interact and affect one
series of test questions, there is no evaluation another). There is evidence that response
or discussion of the results of any individual variance for these questions is not actually
series of questions until all test question independent (Barland, Honts & Barger, 1989;
series have been fully recorded and analyzed. Podlesny & Truslow, 1993; Raskin, Honts &
If an acquaintance test was not conducted Kircher, 2014), and for this reason field
earlier it may be conducted after reviewing practices do not permit both positive and
the test questions and before proceeding to negative test results within a single
the in-test phase of the exam. Some earlier examination. Regardless of whether
polygraph testing formats employed a conducted for diagnostic or screening
procedure analogous to an acquaintance test, purposes, all polygraph examinations are
though following the first presentation of the ultimately interpreted at the level of the test
test stimuli during the in-test phase of the as a whole, though subtotal scores for
exam. individual RQs may be evaluated according to
standardized procedures.
In-test data collection
Most polygraph examinations in the
The second phase of the polygraph United States today are conducted with some
examination is that of in-test data collection. variant of the comparison question technique
This may be accomplished using any of a (CQT). The CQT was first described in
variety of validated diagnostic or screening publication by Summers (1939) while he was
test formats (American Polygraph head of the Psychology Department at the
Association, 2011b; Department of Defense, Fordham University Graduate School in New
2002). All screening and diagnostic polygraph York. The CQT was popularized within the
techniques include relevant questions (RQs) polygraph profession by Reid (1947) and
that describe the examinee's possible Backster (1963). It is the most commonly
involvement in the behavioral issues under used and exhaustively researched family of
investigation. Effective relevant questions will polygraph techniques in use today. In
be simple, direct, and should avoid legal or addition to RQs, these polygraph techniques
clinical jargon and words for which the also include comparison questions (CQs;
correct meaning may be ambiguous, referred to in earlier polygraph literature as
confusing or not recognizable to persons control questions). When scoring a test,
unfamiliar with legal or professional examiners will numerically and statistically
vocabulary. Each relevant questions must evaluate differences in responses to RQs and
address a single behavioral issue. CQs.

Relevant questions of event-specific The traditional form of comparison

Polygraph, 2015, 44(1) 31


Nelson

question is the probable-lie comparison (PLC) 1997) and at the U.S. Department of Defense
questions, while some evidence-based (Honts & Reavy, 2009). The major difference
contemporary CQTs make use of the directed- between PLC and DLC techniques is that DLC
lie comparison (DLC). Examiners who use techniques are transparent and can be used
PLCs will maneuver the examinee into without the need to maneuver or manipulate
denying a common behavioral issue that is the examinee into denying a common
not the target of the examination. Probable-lie behavioral issue.
comparison questions have been the basis of
some criticism of the polygraph technique DLCs have been shown in numerous
due to their manipulative nature, and also studies, summarized by Blalock, Nelson,
the uncertainty surrounding the veracity of Handler and Shaw (2011; 2012), to perform
the examinee regarding these questions classification tasks with equal efficiency and
(Furedy, 1989; Lykken, 1981; Office of similar statistical distributions of numerical
Technology Assessment, 1983; Saxe, 1991). scores (American Polygraph Association,
Some of these criticisms rest on an 2011) compared to PLC exams. Some
inaccurate assumption that the polygraph researchers have suggested that DLCs are
measures actual lies per se. The polygraph, less ethically complicated than PLCs because
like many scientific tests, records responses they do not require the examiner to
to stimuli. Polygraph instruments do not psychologically manipulate the examinee
actually measure lies, but instead (Honts & Raskin, 1988; Honts & Reavy, 2009;
discriminate deception and truth-telling Horowitz et al., 1997; Kircher, Packard, Bell,
through the use of probability models and & Bernhardt, 2001; Raskin & Kircher, 1990).
statistical reference data that describe the DLC examinations have also been shown to
differences in the patterns of reactions of retain effectiveness in different languages and
truthful and deceptive persons when cultures (Nelson, Handler & Morgan, 2012).
responding to RQs and CQs.
In addition to PLC and DLC questions,
Although not a control in the strictest other variants of comparison questions have
sense, CQs serve a similar function as a been suggested and argued, including
control in that they allow an examiner to exclusive comparison questions and non-
effectively parse and compare diagnostic exclusive (i.e., inclusive) comparison
variance and other sources of variance. questions. Studies have shown all of these
Response variance of CQs is not completely CQ variants to perform with similar
independent of the investigation target issue effectiveness, for which accuracy does not
in the same way that scientific controls are - differ at a statistically significant level (Amsel,
because responses for CQs and RQs are from 1999; Honts & Reavy, 2009; Horvath &
the same examinee. This model of testing can Palmatier, 2008; Horvath, 1988; Palmatier,
be thought of as analogous to the way that 1991). A recent meta-analytic survey
data is acquired from the subjects in a two- (American Polygraph Association, 2011b) has
way repeat measures ANOVA design - in further solidified this conclusion, in
which each subject serves as his or her own demonstrating that the same polygraph
control set. In this way, each polygraph techniques perform with equivalent
examination serves as a form of single effectiveness and no significant differences in
subject scientific experiment. the sampling distributions of criterion
deceptive and criterion truthful scores, when
Directed lie comparison (DLC) the techniques are employed with PLC or
questions have been introduced as an DLC questions. Scientific assumptions
alternative to the use of PLCs (Barland, 1981; underlying scoring models for PLC and DLC
Research Division Staff, 1995a; 1995b). DLCs techniques, assuming only that examinees
are used in polygraph techniques developed will respond differently to relevant and
by the United States (U.S.) government for comparison stimuli as a function of deception
use in polygraph screening programs, and in in response to RQs.
diagnostic polygraph techniques developed by
researchers at the University of Utah (Honts All polygraph techniques may include
& Raskin, 1988; Kircher, Honts & Raskin, other procedural questions that are not

32 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

numerically scored. Procedural questions a minimum of three times and as many as


designed for other technical examination five times. The common method is to repeat
purposes have not been supported by the entire series of test questions, while
scientific studies, including: overall truth pausing the recording and deflating the
questions (Abrams, 1984; Hilliard, 1979), cardio sensor in between repetitions. Some
outside issue questions - referred to also as examination protocols (Department of
“symptomatic” questions - that attempt to Defense 1995a, 1995b; Handler, Nelson &
inquire about interference from outside the Blalock, 2008) achieve several repetitions of
scope of the examination questions (Honts, the test questions without pausing the
Amato & Gordon, 2004; Krapohl & Ryan, examination.
2001), guilt-complex questions (Podlesny,
Raskin & Barland, 1976), and sacrifice Test data analysis – scoring of polygraph
relevant questions regarding an examinee's examinations
intent to answer truthfully (Capps, 1991;
Horvath, 1994). The absence of evidence to Prior to informing the examinee or
support their validity has led to the others of the results of the polygraph
abandonment of the use of most of technical examination, the examiner must analyze the
questions. test data. Procedures for test data analysis
are designed to partition and compare the
Only two non-scored technical sources of response variance: variance in
questions remain widely used today, and response to RQs and variance in response to
these are used only in a procedural sense. CQs. Responses are numerically coded and
Though not numerically scored, and not the result is compared to cutscores that
included in structured or statistical decision represent normative expectations for
models, outside issue questions have been deceptive or truthful persons. The
retained as a structural and procedural part overarching theory of polygraph testing is
of some test formats. Likewise, sacrifice that responses to RQs and CQs vary
questions are valued for the purported significantly as a function of deception and
purpose of absorbing and discarding the truth-telling in response to the RQs.
examinee's initial response to the first
question that describes the investigation The basic premise of numerical
target issue. These questions are also not scoring of polygraph exams was first
numerically scored and not included in described by Kubis (1962) a method similar
structured or statistical decision models. Un- to that of Likert (1932) who showed how to
scored sacrifice questions are included in reduce subjectivity using numerical coding of
virtually all modern polygraph techniques in ordinal non-linear response data. Numerical
use today. scoring was popularized within the polygraph
profession by Backster (1963) as the seven-
A basic principle of measurement and position scoring system, and has been
testing is to obtain several measurements for subject to further development and
each issue of concern. This is accomplished refinement through empirical study. The
during polygraph testing by the use of several procedural construct for evaluation of
component sensors, each of which is differences in reaction to RQs and CQs can
designed to monitor increases or changes in be traced to Summers (1939), who used a
activity in the autonomic nervous system, question sequence consisting of three
and by the standard practice of aggregating relevant target questions and three
or combining the responses to several comparative response questions repeated
presentations of each test stimulus (Bell, three times. Resulting data are found to
Raskin, Honts, & Kircher, 1999; Kircher & produce different distributions for the
Raskin, 1988; Raskin, Kircher, Honts, & different criterion groups, and these
Horowitz, 1988; Reid, 1947; Research distributions can be used to classify other
Division Staff, 1995a; 1995b). Field polygraph case observations. Variants of this model are
procedures (Handler & Nelson, 2008; Kircher observed in both signal detection and signal
& Raskin, 1988; Department of Defense, discrimination theory (Wickens, 1991; 2002).
2006a) require that test stimuli are presented

Polygraph, 2015, 44(1) 33


Nelson

There are three commonly used (Greenberg, 1982; Lehmann, 1950; Lieblich
variants of the seven-position scoring system Ben-Shakhar, Kugelmass, & Cohen, 1978;
in use today, including the model developed Pratt, Raiffa, & Schlaifer, 1995; Wald, 1939;),
by the U.S. Government (Department of like statistical learning theory (Hastie,
Defense, 2006a; 2006b), the model published Tibshirani & Friedman, 2001) is concerned
by ASTM International (2002), and the one with making optimal decisions. Signal
developed by researchers from the University detection theory is concerned with identifying
of Utah (Bell, Raskin, Honts, & Kircher, 1999; and separating useful information from
Kircher & Raskin, 1988; Raskin & Hare, background noise or random information
1978) and described by Handler (2006) and (Green, & Swets. 1966; Marcum, 1947;
Handler and Nelson (2008). Differences Schonhoff, & Giordano, 2006; Swets,1964;
between these seven-position methods are Swets, 1996; Tanner, Wilson, & Swets, 1954).
procedural and may be inconsequential in Signal detection theory includes two
terms of test accuracy (American Polygraph fundamental models: signal detection (e.g.,
Association, 2011b). Yes or No) (Wickens, 2002) and signal
discrimination models (e.g., A or B) (Wickens,
A commonly used modification of the 1991). CQT polygraph testing represents the
seven-position scoring model is the three- second form, of these two, in that polygraph
position system defined by the Department of decisions attempt to achieve diagnostic
Defense (2006a, 2006b). Three-position accuracy when placing or predicting
scoring models are favored by some individual examinee membership into
examiners for their simplicity and reliability, criterion categories of deception and truth-
though there is a known increase in telling. Recent efforts (Nelson, Krapohl &
inconclusive test results (American Polygraph Handler, 2008; Nelson et al., 2011; Nelson &
Association, 2011b) when using numerical Handler, 2010) have begun to make more
cutscores intended for seven-position scores. extensive use of statistical decision theory to
The Empirical Scoring System (Nelson, quantify the probability of erroneous
Krapohl & Handler, 2008; Nelson & Handler, polygraph test results.
2010; Nelson et al., 2011) is a statistically
referenced, standardized, and evidence-based Physiological reaction features.
modification of the three-position and seven- Scoring of polygraph examinations begins
position scoring models, with test accuracy with the identification of observable or
comparable to other scoring models, though measurable physiological responses that are
without the increase in inconclusive results. correlated with deception and which can be
combined into an efficient and effective
As a theoretical matter, scoring of diagnostic model. A number of studies -
polygraph examinations is not different from largely funded by the U.S. Department of
the evaluation of other scientific tests in Defense and conducted at the University of
medicine, psychology, and forensics, and Utah and Johns Hopkins University - have
involves four basic concerns: 1) the described investigations into the
identification of observable or measurable identification and extraction of polygraph
criteria (i.e., scoring features), 2) scoring features (Bell, Raskin, Honts, &
transformation of scoring features to Kircher, 1999; Harris, Horner, & McQuarrie,
numerical values, and reduction of numerical 2000; Kircher, Kristjiansson, Gardner, &
values to a grand total index for the Webb, 2005; Kircher & Raskin, 1988; Raskin
examination as a whole and subtotal indices et al., 1988). These efforts are reflected in
for the individual examination items, 3) field practice standards published by the
statistical reference distributions to calculate Department of Defense (2006a; 2006b), by
statistical classifiers and numerical ASTM International (2002), and in
cutscores, and 4) structured decision policies. publications on the Empirical Scoring System
(Nelson & Handler, 2010; Nelson et al., 2011).
Discussions of polygraph scoring
methods are inseparable from discussions of A small number of physiological
test theory, including both decision theory indicators have repeatedly shown to be
and signal detection theory. Decision theory correlated with deception in structural

34 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

decision models presently used in field Numerical_transformations.


polygraph programs. They are: respiration – Numerical scores, in the form of non-
observed as the respiration line length parametric integers of positive or negative
(Kircher & Raskin, 1988), respiration value, are assigned to each presentation of
excursion length (Kircher & Raskin, 2002), or each RQs by comparing the strength of
as either sustained decreases in respiration reaction to each RQ with the strength of
amplitude for three or more respiratory reaction to the CQs presented in sequence
cycles, slowing of respiration rate for three or with the RQs. A fundamental assumption
more cycles, or temporary increases in during comparison question testing, is that
respiratory baseline of three cycles or more, both truthful and deceptive examinees may
and apnea; electrodermal activity – observed exhibit some degree of reaction to relevant
or measured as an increase in skin questions stimuli. Indeed, Ansley (1999),
conductance (decrease in resistance), Ansley and Krapohl (2000) and Offe and Offe
increased duration of response, and multiple (2007) have shown empirically that it is not
responses; and cardiovascular activity – in the presence or absence of a response, nor
the form of increase in relative blood the linear magnitude of response to the
pressure, increased duration of response, relevant questions that discriminates
slowing of heart-rate, and decrease in finger deception from truth-telling. Instead, the
blood-pulse volume. simple relative magnitude or degree of
response to CQs, relative to the degree of
Scoring features have been described response to the RQs, is the differentiating
as either primary or secondary (Bell, et al.,, characteristic between deceptive and truthful
1999; Department of Defense, 2006a; 2006b). examinees.
Primary features are those that capture the
greatest degree of variance in deceptive and Deceptive examinees generally exhibit
truthful responses to RQs and CQs within in larger magnitude of change in autonomic
each of the recorded physiological channels. activity in response to relevant stimuli than
Secondary features are correlated with comparison stimuli, while truthful examinees
differences in deceptive and truthful will generally exhibit larger magnitude of
responses at statistically significant levels, change to comparison stimuli than to
but have weaker correlation coefficients relevant stimuli. Deceptive scores are
compared to primary features. Also, assigned when the magnitude of change to
secondary features provide information that relevant question stimuli are greater than
is so strongly correlated with their primary comparison question stimuli. Conversely,
counterparts that the added information is truthful scores are assigned whenever the
largely redundant and may not be additive to degree of change in response to the
the effectiveness of some structural models. comparison question stimuli is greater than
Some computerized scoring algorithms responses to the relevant question stimuli.
(Honts & Devitt, 1992; Kircher & Raskin, Numerous scientific reviews of countless
1988; Krapohl, 2002; Krapohl & McManus, scientific studies have affirmed the validity of
1999; MacLaren & Krapohl, 2003; Nelson, the operational construct that responses to
Krapohl & Handler 2008; Raskin et al., 1988;) relevant and comparison stimuli vary as a
in use today, and the evidence-based function of deception or truth-telling
Empirical Scoring System (Nelson & Handler, regarding a past behavior (American
2010; Nelson et al., 2011) have been designed Polygraph Association, 2011; Ansley, 1983;
to use only primary features, forgoing 1990; Abrams, 1973; 1977; 1989; National
reaction features considered secondary in Research Council/National Academy of
importance. Primary features, sometimes Science, 2003; Nelson & Handler, 2013;
referred to as “Kircher features” are the Office of Technology Assessment, 1983;
following: respiration – observed as excursion Podlesny & Raskin, 1978; Raskin, Honts &
length or correlated patterns, electrodermal Kircher, 2014).
activity - observed as the amplitude of
vertical increase, and cardiovascular activity - Decision cutscores and reference
observed as the amplitude of vertical increase distributions. Numerical test scores are
in relative blood pressure. translated into categorical test results

Polygraph, 2015, 44(1) 35


Nelson

through the comparison of test scores with Reference distributions have been
cutscores that are anchored to reference summarized (American Polygraph
distributions that describe the statistical Association, 2011; Nelson & Handler 2015) in
density or probability of obtaining each the form of descriptive statistics that inform
particular score within the range of possible us about the location (i.e., mean or average),
test scores. Because all scientific data are a dispersion (i.e., variance or standard
combination of both diagnostic variance (i.e., deviation) and shape of the distribution of
explained variance) and unexplained variance scores observed in the sampling data for
(i.e., error variance, random variance or criterion deceptive and criterion truthful
uncontrolled variance), individual scores can persons. Published reference distributions
be expected to vary somewhat within each can be used to calculate the margin of
examination. For this reason, aggregated test uncertainty, in the form of a level of
scores have been found to provide the statistical significance, odds ratio, confidence
greatest diagnostic efficiency. This is often in level or probability of error, associated with
the form of a grand total score, though any possible test score. Test results are said
subtotal scores are also used with some to be statistically significant when the
polygraph techniques. Grand total and probability of error is less than or equal to a
subtotal scores are compared to cutscores declared probability cutscore or alpha level
and statistical reference distributions to (i.e., p <= ). This is equivalent to the
determine the likelihood that an observed test condition when a test score equals or exceed
scores has occurred due simply to a cutscore.
uncontrolled error variance or random
chance. Decision rules. Decision rules are the
practical and procedural comparison of test
Probability cutscores are an scores with either traditional cutscores (Bell,
expression of our tolerance for uncertainty or et al., 1999; Department of Department of
error, expressed as a statistical probability of Defense, 2006a; 2006b; Kircher & Raskin,
error, often using the Greek letter , declared 1988; Raskin et al., 1988) or cutscores
prior to conducting an examination. A selected for their level of statistical
common probability cutscore in polygraph significance and probability of error using
and other scientific disciplines is .05, with published reference distributions (American
the goal of constraining the proportion of Polygraph Association, 2011, Krapohl &
errors to 5% or less while attempting to McManus, 1999; Krapohl, 2002; Nelson,
provide a minimum confidence level of 95% Krapohl & Handler, 2008; Nelson & Handler,
for the categorical test result. Alternative 2010; 2015; Nelson et al., 2011).
probability boundaries of .10 and even .01, Procedurally, following the assignment of
representing intended confidence levels of numerical scores, all scores are aggregated by
90% and 99%, are sometimes used when summing the subtotal scores for all
testing objectives indicate a need for fewer presentations of each RQ stimuli. Subtotal
inconclusive or unresolved test results or scores are then summed to achieve a grand-
(.10) for fewer errors (.01). total score for event-specific diagnostic
exams. Procedural decision rules are
Cutscores have also been determined constructed with regard for assumptions of
using performance curves (Bell et al., 1999) independent and non-independent criterion
and through heuristic experience. Regardless variance of the RQs of event-specific
of the method used to determine numerical diagnostic exams and multi-issue screening
cutscores, all decision cutscores will have exams.
some associated statistical information to
describe the level of significance or Procedural decision rules for event
probability of error. The relationship between specific diagnostic examinations, for which
numerical scores and associated statistical the criterion variance of the several RQs is
reference distributions can be calculated assumed to be non-independent or
mathematically, and can also be conveniently dependent (i.e., all RQs address a single event
determined using published reference tables. for which the criterion status of different test
stimuli may be strongly related), will make

36 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

use of the grand total score to make the most comparing each subtotal score to cutscores
accurate classification possible regarding the derived from statistical distributions of
test as a whole. Some decision rules, such as subtotal scores of examinations constructed
those used by the Department of Defense of questions for which the criterion state was
(2006a; 2006b), as described by Light (1999), assumed to vary independently (American
or those developed by Senter and Dollins Polygraph Association, 2011; Nelson and
(2002) may also make use of subtotal scores Handler, 2010; Nelson et al., 2011).
in attempt to reduce inconclusive results,
increase test sensitivity to deception, or Although the SSR involves the use of
reduce false-negative errors. Decision results individual subtotal scores, previous research
involving the grand total, referred to herein as (Barland, Honts & Barger, 1989; Podlesney &
the grand-total-rule (GTR), are accomplished Truselow, 1993; Raskin, Honts & Kircher,
by comparing the grand total score to 2014; Raskin, Kircher, Honts & Horowitz,
cutscores for deceptive and truthful 1988) has shown that although comparison
classifications. When using the GTR, test question polygraph tests are effective at
results are statistically significant and a differentiating individuals who are truthful or
categorical conclusion is made if the grand deceptive, these tests are not as effective at
total score equals or exceeds one of the determining the exact question or questions,
cutscores. within a series, to which an individual has
lied or told the truth. The reasons for this
A two-stage modification of the GTR, may have to do with both the psychological
referred to herein as the two-stage-rule (TSR) and attentional demands of multiple
was described by Senter (2003) and Senter independent stimulus targets, and the
and Dollins (2002; 2008). The TSR allows the mathematical and statistical complexities
use of subtotal scores to achieve a categorical that result from aggregating the sensitivity,
conclusion when the grand-total score is not specificity, false-positive and false-negative
statistically significant (i.e., inconclusive). rates of multiple independent results. Test
When used, subtotal scores should be questions of multiple issue exams also have a
compared with cutscores that are statistically shared, non-independent, source of response
corrected for the known inflation of alpha, variance in the form of the examinee. For
and associated potential increase in false- these reasons, the final classification of
positive errors, that results from the use of examination results as belonging to the
multiple statistical comparisons (Abdi, 2007, groups of deceptive or truthful persons is
Nelson and Handler, 2010; Nelson et al., always determined at the level of the test as a
2011; Nelson, Krapohl & Handler, 2008). Use whole.
of a simple statistical correction, referred to
as a Bonferroni correction, can prevent an When using the SSR, the test result is
increase in false-positive errors when using classified as deceptive if any independent
subtotal scores of event-specific diagnostic question produces a result that is statistically
exams. significant for deception, while truthful
classifications require that the results of all
By definition, the criterion states of independent questions are statistically
the RQs of multi-issue screening exams are significant for truth-telling. Field practices
assumed to vary independently. For this (American Polygraph Association, 2009a;
reason, grand total scores are generally not Department of Defense, 2006a; 2006b) do not
used with multi-issue screening exams, for support the interpretation of responses to
which the subtotal scores are more some questions as truthful and other
commonly used. Scores for multi-issue responses as deceptive within a single
screening polygraphs are commonly examination. Of course, statistical methods
evaluated using the individual subtotal involving regression, variance and covariance
scores (Department of Defense, 2006a; may provide capabilities not available within
2006b; Nelson and Handler, 2010; Nelson et the simple procedural rubric of the SSR.
al., 2011; Nelson, Krapohl & Handler, 2008),
using a rule referred to herein as the subtotal- Subtotal cutscores for truth-telling
score-rule (SSR). The SSR is executed by should ideally be determined using

Polygraph, 2015, 44(1) 37


Nelson

procedures to statistically correct for the cardiovascular or respiratory health.


potential reduction of test specificity when
requiring multiple statistically significant Polygraph instrumentation consists
truthful scores before a truthful classification minimally of three component sensors: two
is made. A common solution for this pneumograph sensors (thoracic and
correction in the statistical and mathematical abdominal) to record breathing movement
sciences, as has been described in procedural activity, electrical sensors to record
methods for polygraph scoring (Nelson and autonomic activity in the palmar or distal
Handler, 2010; Nelson et al., 2011), is the regions (Handler, Nelson, Krapohl & Honts,
Šidák correction, used when requiring 2010), and cardiovascular sensors to record
multiple statistically significant independent relative changes in blood pressure (American
probability events (Abdi, 2007). Some Polygraph Association, 2009a, 2009b;
procedures involve the use of subtotal scores Department of Defense 2006a). Vasomotor
with traditional cutscores that are not derived sensors (Kircher & Raskin, 1988; Bell et al.,
from statistical reference distributions but 1999) are regarded as optional components of
are instead based on classification the polygraph instrument. Field testing
performance curves or heuristic experience protocols since 2007 have recommended the
(Bell et al., 1999; Department of Defense, use of activity sensors to aid in the detection
2006a; 2006b). of countermeasure activity sensors, and are
now required by the American Polygraph
Physiological basis for the polygraph Association (2011a) as of January 1, 2012.
This core combination of required sensors
Although a thorough and detailed has been studied for several decades, and
description of the physiological responses has been empirically shown to produce
recorded by the polygraph is beyond the numerical scores that are structurally
scope of this paper, a practical description of correlated with the criterion states of
polygraph physiology in inextricably linked deception and truth-telling in statistical
with the need to translate changes in reference distributions from development and
recorded data into the form of test scores and validation samples used in both field and
test results. In contrast to earlier models for laboratory studies (American Polygraph
test data analysis that relied somewhat Association, 2011; Bell, Raskin, Honts &
heavily on pattern recognition as a means of Kircher, 1999; Harris & Olsen, 1994; Harris,
monitoring and observing physiological Horner & McQuarrie, 2000; Horowitz,
activity (Department of Defense, 2004), Kircher, Honts & Raskin, 1997; Kircher &
evidence-based models in use today will Raskin, 1988; Kircher, Kristjiansson,
employ only those physiological features that Gardner & Webb, 2005; MacLaren & Krapohl,
are amenable to measurement and that have 2003; Offe & Offe, 2007; Olsen, Harris &
been shown, through published and Chiu, 1994; Raskin, Kircher, Honts &
replicated peer reviewed scientific studies, to Horowitz, 1988).
be correlated at statistically significant levels
with differences in response to different types The physiological mechanics of
of test stimuli that occur as a function of polygraph responses during comparison
deception or truth-telling regarding past question tests occur in the context of the
behavior (Bell, et al., 1999; Harris, Horner & autonomic nervous system (ANS) which
McQuarrie, 2000; Kircher, Krisjiansson, includes both sympathetic (S/ANS) and
Gardner & Webb, 2005; Kircher & Raskin, parasympathetic (PS/ANS) components
1988; Podlesny & Truslow, 1993, Raskin, (Bear, Barry, & Paradiso, 2007; Costanzo,
Kircher, Honts & Horiwitz, 1988). Polygraph 2007; Maton et al., 1993; Paradiso, Bear, &
recording instrumentation has tended to Connors, 2007; Silverthorn, 2009; Standring,
focus on the acquisition of physiological 2005). The ANS regulates involuntary
response data that is of practical use to the processes including cardiac rhythm,
task of scoring and interpreting polygraph respiration, salivation, perspiration, and
test results, with few capabilities beyond that other forms of arousal. S/ANS activity is
objective. For example: polygraph responsible for stimulation of the internal
instrumentation is not used to evaluate organs in response to activity demands.

38 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

PS/ANS activity serves to reduce (respiratory suppression, electrodermal


physiological activation to the minimum level activity, and cardiovascular activity) are
necessary to ensure both longevity and easily observed and recorded.
adequate response to situational demands.
PS/ANS and S/ANS activity are therefore in Common misconceptions about the
homeostatic balance with respect to real or polygraph include the notion that the
perceived demands. The alternative to polygraph measures deep or rapid breathing,
homeostatic balance is a general state of sweaty palms or sweating activity, and rapid
disease that may eventually lead to death. or increasing heart rate activity. Of these, the
For this reason, every form of change in the first two, increased respiratory activity and
ANS can be thought of as intended to sweating activity, are known to be inaccurate
maintain homeostasis and survival. and unsatisfactory explanatory models for
polygraph reactions. Only the third, heart
Polygraph examiners are primarily rate, has been included in validated
interested in recording and observing S/ANS statistical classifiers for deception and truth-
activity, but it is important to understand telling. However, it is slowing of cardiac
that some activity, as with some rhythm, not increase, which is correlated
cardiovascular data and respiration with deception (Kircher et al., 2005; Raskin &
responses of interest to polygraph examiners, Hare, 1978). Pulse rate is included in the
may actually be the result of changes in statistical model for the PolyScore algorithm
PS/ANS activity. (For more information about (Blackwell, 1998; Dollins, Krapohl & Dutton,
the dual innervation of the autonomic 1999; Dollins, Krapohl & Dutton, 2000;
nervous system the reader is directed to more Harris & Olsen, 1994; Olsen et al., 1991;
complete works by Janig (2006), Porges Olsen et al., 1994; Olsen, Harris, Capps &
(2014), Handler and Richerter (2014)). The Ansley, 1997). Changes in heart rate are not
process of change, with the goal of included in other validated statistical models
maintaining homeostasis, is referred to as for scoring comparison questions tests. Pulse
allostasis (Sterling & Eyer, 1988; Berntson & rate activity is therefore rarely included in
Cacioppo, 2007). Changes recorded during polygraph decisions in field settings.
polygraph testing can be thought of as
allostatic changes (Handler, Rovner & Nelson, Manual analysis of the relative or
2008) that occur in an attempt to attain or absolute magnitude of change or response in
maintain homeostasis. phasic activity is easily accomplished for
electrodermal activity and cardiovascular
Observable and recordable activity. Electrodermal data has been shown
physiological changes in physiological activity to be a strong indicator of S/ANS arousal
that are structurally correlated with (Boucsein, 2012), and to be the most robust
deception and truth-telling during and reliable contributor to the final score and
comparison question testing include the resulting classification of comparison
following three features: 1) subtle and question polygraph test results (Ansley &
temporary respiratory suppression (i.e., Krapohl, 2000; Harris & Olsen, 1994;
suppression or reduction of respiratory Kircher, 1981; 1983; Kircher & Raskin, 2002;
movement, 2) relative magnitude of phasic Kircher & Raskin, 1988; Kircher et al., 2005;
electrodermal activity indicative of increased Krapohl & McManus, 1999; Nelson, Krapohl
S/ANS activity, 3) relative magnitude of & Handler, 2008; Olsen et al., 1997; Raskin
phasic response in the moving average of et al., 1988).
relative blood pressure. These measurable
reactions have been described in several Reactions to test stimuli can be
publications (ASTM International, 2002; Bell evaluated through either non-parametric
et al., 1999; Department of Defense, 2006a, observation or through linear measurement.
2006b; Harris, Horner & McQuarrie, 2000; However, polygraph instrument manufacturer
Kircher & Raskin, 1988; Kircher et al., 2005; s have not completely standardized the signal
Krapohl & McManus, 1999; Raskin & Hare, processing and feature extraction methods
1978; Raskin et al., 1988). Physiological whereby data obtained during polygraph
responses for these three primary sensors testing are to be anchored to linear changes

Polygraph, 2015, 44(1) 39


Nelson

in physiological activity. As a result, (Honts & Handler, 2014).


measured or responses to polygraph stimuli
are used only in automated statistical Respiratory suppression, though
classification models within the polygraph accurately measured by the curvilinear
testing paradigm and may not be directly distance (Kircher & Raskin, 1988; Raskin et
related to measurements of similar al., 1988; Timm, 1982) or sum of absolute
physiological activity as utilized in medical magnitude of change in y-axis excursion
fields. Most polygraph scoring paradigms will (Kircher & Raskin, 2002), is not easily
a non-parametric feature extraction method. measured without mechanical devices. Field
polygraph examiners are taught to evaluate
Cardiovascular responses during recorded data for the presence or absence of
comparison question testing have been reaction patterns that have been described as
shown to be correlated with the criterion correlated with the criterion categories of
categories of deception and truth-telling at deception and truth-telling (Raskin & Hare,
statistically significant levels (Bell et al., 1978; Bell et al., 1999; Harris et al., 2000;
1999; Harris et al., 2000; Kircher & Raskin, Kircher & Raskin, 1988; Kircher et al., 2005;
1988; Kircher et al., 2005; Nelson et al., 2008; Raskin et al., 1988). Pattern features shown
Raskin et al., 1988). The structural to be correlated with respiratory suppression
correlation for cardiovascular response and CQT criterion categories are few, and
activity has been shown to be weaker than include the following: 1) a subtle and
that of electrodermal response data, though temporary reduction of the tidal or inhalation
stronger than that of respiratory response volume resulting in a reduction of the y-axis
data. Diagnostic features and the (vertical) magnitude of the respiratory
interpretation of cardiovascular response tracings for multiple respiratory cycles
activity were described by Handler and following the onset of the test question
Reicherter (2008) and Handler, Geddes and stimulus, 2) a subtle and temporary slowing
Reicherter (2007). Some field polygraph of respiratory rate for multiple respiratory
examiners make use of photoelectric cycles following the onset of the test question
plethysmograph data, a form of stimulus, and 3) a subtle and temporary
cardiovascular recording for which elevation of the exhalation baseline or
information has been described by Handler residual volume for multiple respiratory
and Krapohl (2007), Geddes (1974) and cycles following the stimulus onset. Apnea is
Honts, Handler, Shaw & Gougler (2015). also correlated with differences in deception
and truth-telling (Bell et al., 1999; Kircher &
Of the three physiological sensors, Raskin, 1988), but can be easily feigned.
respiratory data has found to be the most
susceptible to disruption from voluntary Polygraph sensors, while capable of
activity during polygraph testing. Respiration recording sympathetic autonomic responses
data has the weakest structural coefficients to test stimuli, are non-robust against
of the required polygraph sensors (Harris & disruptive somatic or physical activity that is
Olsen, 1994; Harris et al., 2000; Kircher & sometimes not easily observed. In response to
Raskin, 1988; Kircher et al., 2005; Nelson, concerns about the potential for attempted
Krapohl & Handler, 2008; Olsen et al., 1997; faking during testing (i.e., countermeasures),
Raskin et al., 1988). However, field examiners somatic activity sensors have been developed
have learned to evaluate respiration data for to detect and record both overt and covert
indicators of cooperation or non-cooperation physical activity. There is indication in the
during testing, in addition to evaluating literature that somatic activity sensors can
respiration data for indicators of deception increase examiners’ ability to observe and
and truth-telling. Some research (Kircher et detect these attempts (Ogilvie & Dutton,
al., 2005) has suggested that pneumograph 2008; Stephenson & Barry, 1986). In the
data may be less diagnostic during absence of recorded data of artifacted, odd or
comparison question tests conducted using uninterpretable quality that indicates overt or
DLC exams, while other findings have shown covert physical activity, field examiners will
that respiration data of DLC exams does assume that responses recorded by the
contain useable diagnostic information respiration, electrodermal and cardiovascular

40 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

sensors have their origins in the ANS and are explanation for observed differences in
not altered or contaminated by covert somatic response to different test stimuli (Palmatier &
activity. Rovner, 2014), though more discussion is
needed to fully understand the advantages
Psychological basis for the polygraph and limitations of this theory as applied to
the polygraph. Until a more detailed evidence
A satisfactory psychological theory will is described, a general constructed would
parsimoniously and holistically account for suggest that all responses to test stimuli
the variety of known and observed result from some combination of mental
phenomena associated with the polygraph activity, emotion, and behavioral
test. Such a theory will explain electrodermal conditioning. All of these may play a role in
responses, cardiovascular responses, and physiological reactions that are load
respiratory responses, to both PLC and DLC differentially for different types of polygraph
question, and will contribute to our test stimuli (i.e., relevant and comparison
understanding of test accuracy with both questions) as a function of deception or
psychopathic and non-psychopathic persons. truth-telling in response to relevant stimuli
Moreover, a sound understanding of the that describe a behavioral issue of concern. It
psychological basis of the polygraph test will will be important to refrain from attempting
enable us to better understand issues of test to define which single emotion, or define the
suitability and unsuitability (i.e., for whom exact focus of attention and cognition within
the test may or may not work). A the examinee until such time as evidence
comprehensive theoretical understanding of exists to verify a more detailed description.
the psychological basis for responses to
different testing paradigms such as the CQT Field examiners have tended
and other polygraph and lie-detection historically to simplify the explanation of
paradigms – such as the concealed- polygraph psychology to a minimum level
information-test (CIT), which uses similar that satisfies both themselves and their
recorded physiological signals as a basis for examinees. This was often done using a
ipsative calculations of the statistical scientifically unsatisfactory explanation of
significance of differences in responses to “psychological set” (see Handler & Nelson,
different test stimuli. Finally, a satisfactory 2007) as related to the fight-or-flight
psychological theory for polygraph testing will response that has been attributed to Cannon
achieve a coherent integration of scientific (1929). Although now regarded as an
knowledge regarding the polygraph with inadequate model for both polygraph
extant knowledge in related fields of science responses and stress responses in general
including cognitive, social and behavioral (Bracha, et al., 2004; Taylor et al., 2000), the
psychology, psychophysiology, signal application of this now problematic
detection theory, decision theory, statistical hypothesis holds that examinees will focus
learning theory and more. their attention and physiological response to
the question or issue that presents the
While a comprehensive discussion of greatest immediate threat to their survival
the psychological basis for polygraph testing and well-being. The most obvious evidence of
is beyond the scope of this paper, a brief the limitations of the “psychological set”
explanation will hold that the psychological hypothesis is that it cannot account for the
basis for responses to polygraph test stimuli effectiveness of DLCs, and does not
involves a constellation of simple adequately explain test effectiveness with
psychological mechanisms including psychopaths who have been shown to have
cognition, emotion, and behavioral low levels of fear conditioning (Birbaumer, et
conditioning (Handler & Nelson, 2007; al., 2005). Additionally, the “psychological
Handler, Shaw & Gougler, 2010; Kahn, set” requires the assumption that polygraph
Nelson, & Handler, 2009; Senter, sensors can identify different types of
Weatherman, Krapohl, & Horvath, 2010). emotions, though the literature does not
Recently, preliminary process theory, related support this notion (Kahn, Nelson, &
to orienting theory (Barry, 1996) has been Handler, 2009). Moreover, this explanation
suggested as a potentially parsimonious suffers from a fundamental vulnerability to

Polygraph, 2015, 44(1) 41


Nelson

suggestions that it is pseudoscientific responses to a repeated sequence of


because it cannot satisfy a fundamental polygraph test stimuli will be loaded onto
scientific requirement for falsifiability different types of test stimuli as a function of
(Popper, 1959). deception and truth-telling regarding the
investigation target issues, and that the basis
Handler & Nelson (2007) described the of observed responses can be thought of as
troublesome origins of the term originating in cognition, memory, emotion
“psychological set” which does not appear in and conditioned experience relative to the
the scientific psychological literature in the test stimuli. Relative differences in response
form employed by polygraph examiners. to different types of test stimuli can be
Differential salience has been suggested as a compared with statistical reference
more general and parsimonious psychological distributions and evaluated for their level of
theory that is more consistent with the field statistical significance to quantify the margin
of scientific psychology, including emotion, of uncertainty regarding a categorical
cognition and conditioned learning as a basis conclusion of deception or truth telling. This
of response to polygraph stimuli (Senter, form of theoretical explanation is
Weatherman, Krapohl & Horvath, 2010). fundamentally testable and therefore
fundamentally scientific (Popper, 1959).
Polygraph responses might also be
accounted for using the conceptual Accuracy of polygraph examinations
framework of behavioral conditioning, as first
described by Pavlov (1927), and learning Results from several decades of
theory, including the concepts of sensitization scientific study have consistently supported
and habituation (Domjan, 2010, Groves & the validity of the hypothesis that the
Thompson, 1970). A conditioned learning combination of instrumental recording and
model for responses to polygraph stimuli statistical modeling can discriminate
suggests that involvement in a serious deception and truth-telling at rates
transgression amounts to a form of single- significantly greater than chance. Scientific
trial behavioral conditioning with test reviews of peer reviewed polygraph studies
questions functioning as a conditioned have borne this out repeatedly. Abrams
stimulus. Polygraph interviewing theory holds (1989) surveyed the published literature and
that a thorough and effective pretest reported an accuracy level of .89. Honts and
interview will give the truthful examinee an Peterson (1997), Raskin (2002), and Raskin &
opportunity to habituate to test questions, Podlesny (1979) reported the accuracy of
while causing a deceptive examinee to polygraph studies as exceeding .90. The
become sensitized to the test questions as a systematic review completed by the Office of
conditioned stimulus. Technology Assessment (1983) suggested that
laboratory studies had an average
Cognitive-behavioral theory, which unweighted accuracy of .83, with slightly
includes cognition, emotion and higher accuracy, .85 from field studies at the
behavioral/experiential learning as a basis of time. Crewson (2001) reported an accuracy
physiological response, has been also rate of .88 for diagnostic polygraphs in a
suggested as an explanatory hypothesis for comparison with medical and psychological
the variety of known polygraph phenomena tests. The National Research Council (2003)
(Kahn, Nelson & Handler, 2009), and this concluded with reservation that the
model is consistent with the salience polygraph differentiated deception from
hypothesis described by Handler and Nelson truth-telling at rates that were significantly
(2007) and Senter et al., (2010). A greater than chance though less than perfect,
generalization of the cognitive-behavioral and reported a median ROC of .89 for field
model for polygraph reactions suggests that studies and .86 for laboratory studies.
truth-telling presents simpler cognitive and
emotional task demands than deception. Different types of studies offer
different advantages and disadvantages. Field
A cognitive-behavioral and differential studies offer assumed ecological validity, but
salience model would hold that physiological are accompanied by a lack of experimental

42 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

control, and by inconsistent case the recorded physiological data or numerical


confirmation and non-random case selection scores, and may have overemphasized the
– making generalization of some field study use of confession information as the case
results troublesome or impossible. Laboratory confirmation and selection criteria. These
studies offer the potential for random factors can potentially introduce non-random
sampling and sufficient experimental control and non-representative case selection criteria
to study questions of causality, but have that can systematically exclude both false-
unknown ecological validity. The general negative and false-positive error cases for
trend in psychological research has been a which a confession is in not likely to be
high level of correspondence between field obtained. Advocacy research involving
and laboratory studies (Anderson, Lindsay & proprietary polygraph techniques seems to
Bushman, 1999), and this fact underscores have also resulted in the production of
the need to avoid confusing ecological validity exaggerated accuracy estimates – often near
with external validity. perfect – sometimes involving the principal
investigator as examiner, scorer and
External validity and ecological technique developer/proprietor. Individual
validity are not synonymous. External validity studies reporting near-perfect accuracy have
– the ability to generalize results to field been described as seriously methodologically
settings - is often achieved from scientific flawed (American Polygraph Association,
studies in laboratory settings with imperfect 2011).
ecological validity. Previous studies by Office
of Technology Assessment (1983), the As with all forms of scientific testing,
National Research Council (2003), the diagnostic tests conducted in response to a
American Polygraph Association (2011), and single issue of concern, for which there is
Pollina et al., (2004) showed no statistically evidence of a problem, will provide greater
significant differences in the results of overall accuracy than multi-issue exams
polygraph test accuracy in field and (American Polygraph Association, 2011;
laboratory studies. Crewson, 2001) that are intended to
simultaneously test several issues for which
The most recent scientific review of the criterion variance is assumed to be
comparison question polygraph techniques in independent (i.e., a person could lie to one or
present use (American Polygraph Association, more investigation target questions while not
2011) reported a mean accuracy of .89 for lying to other investigation targets). Multiple-
event-specific diagnostic polygraphs, with issue examinations involve more probabilistic
some evidence-based methods having been and statistical decisions, and therefore a
shown to provide mean accuracy levels in greater aggregated potential for error and
excess of .90. Multi-issue polygraphs, of the uncertainty compared to single issue exams.
types used in operational security, law Other causes for differences in accuracy
enforcement pre-employment, and post- among diagnostic and screening exams may
conviction screening programs, have been involve the competing attentional demands of
shown to have a mean accuracy rate of .85. multiple test target stimuli, and the potential
More important than mean accuracy that screening exam formats may at times be
statistics are the 95% confidence ranges that systematically biased for test sensitivity -
surround those point estimates, described with the goal of slightly over-predicting
later in this report, especially the lower-limit problems that can be resolved upon further
of test accuracy. investigation. It is also possible that some
screening studies were completed using sub-
Earlier published scientific reviews optimal decision rules and cutscores that
(Abrams, 1973; Ansley, 1983; Ansley, 1990) were not derived through scientific analysis.
have reported higher rates of test accuracy,
often in the upper 90s. Results from these The aggregated sensitivity rate for
earlier systematic reviews are now thought to deception during diagnostic polygraphs was
be confounded by sampling methodologies reported as .84 (95% confidence range .73 to
that may have overemphasized examiner self- .93) and the aggregated specificity rates for
report, possibly without researcher access to truth telling during diagnostic polygraphs

Polygraph, 2015, 44(1) 43


Nelson

was reported as .77 (95% confidence range were reported as .11 for false negative errors
.65 to .85) (American Polygraph Association, and .14 for false positive errors. Inconclusive
2011). Evidence indicates that deceptive rates were also reported as .11% for deceptive
persons have a statistically significantly persons and .14 for truthful persons. The
greater than chance probability of failing a 95% confidence interval for unweighted
polygraph test while truthful persons have a average accurate rate for polygraph
statistically significantly greater than chance examinations constructed of test questions
probability of passing a polygraph. for which the criterion variance of the
relevant questions was assumed to be
Aggregated error estimates for independent was reported as .77 to .93
polygraph diagnostic tests were calculated in (American Polygraph Association, 2011).
the most recent meta-analytic survey using
24 peer reviewed scientific studies involving For diagnostic polygraphs, the Positive
8,975 confirmed scores of field and lab Predictive Value (PPV), or probability that a
studies were reported as .08 for false negative failed polygraph result is correct, was
errors (i.e., deceptive persons who pass the reported as reported at .89 (95% confidence
polygraph) and .12 for false positive errors rage .81 to . 99, while the Negative Predictive
(i.e., truthful persons who fail the polygraph). Value (NPV), or probability that passed
Inconclusive rates were reported as .09 for polygraph result is correct, was reported as
deceptive persons and .13 for truthful .91 (95% confidence rage .82 to .99)
persons. The 95% confidence rage for (American Polygraph Association, 2011). For
decision accuracy of event-specific diagnostic screening polygraphs, PPV was reported as
polygraphs was .83 to .95 (American .83 (95% confidence rage .71 to .94), while
Polygraph Association, 2011). Quite NPV was reported as .88, (95% confidence
obviously, deviations from empirically rage .78 to .97).
validated testing protocols may decrease
expected test accuracy, and, of course, these Conservative judgment necessitates
accuracy estimates assume that each the selection of the lower end of the
examination is conducted on a suitable confidence limit as the boundary at which we
examinee. can be confident that polygraph accuracy
exceeds. Therefore, diagnostic polygraphs can
The aggregated sensitivity rate for be assumed to provide accuracy over .81 for
deception during multi-issue polygraphs, of deceptive results, and over .82 for truthful
the type employed in polygraph screening results, while screening polygraphs can be
programs, was reported as .77 (95% assumed to provide accuracy over .71 for
confidence range .60 to . 90) and the deceptive results and over .88 for truthful
aggregated specificity rates for truth-telling results. However, PPV and NPV are non-
during multi-issue criterion independent resistant to differences in base-rates, and
polygraphs was reported as .72 (95% figures reported herein apply only to balanced
confidence range .63 to .81) (American groups of polygraph exams. Common
Polygraph Association, 2011). Evidence inferential estimates of test accuracy (e.g.,
indicates that deceptive persons have a sensitivity, specificity, inconclusive and error
statistically significantly greater than chance rates) are resistant to differences in base-
probability of failing a test constructed of rates and can be more useful when
multiple independent issues, while truthful interpreting the meaning of the result of a
persons have a statistically significantly single examination, such as when a court is
greater than chance probability of passing a evaluating an individual case.
similarly constructed multi-issue polygraph.
Threats to polygraph accuracy
Aggregated error estimates for
polygraph tests constructed from test Because polygraph tests - like all tests
questions for which the criterion variance is - are inherently probabilistic (i.e., they are
assumed to be independent, were calculated neither deterministic observation nor
from 14 peer reviewed studies involving 1,194 physical measurement), they are not perfect.
confirmed scores of field and lab studies, No probabilistic test is completely immune to

44 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

potential error or threats to test accuracy. summary, polygraph testing with


Although the National Research Council psychopathic persons can be assumed to be
(2003) could not find any scientific evidence similar - as accurate and as inaccurate - as
that any personality type or endogenous that with non-psychopathic persons.
factors significantly affect polygraph test Regardless, public and media reactions may
accuracy, it is commonly understood that tend to simplistically assume that a person
polygraph test accuracy may be compromised has “beaten” the polygraph whenever a
or reduced by the health, level of functioning testing error is observed. Some proportion of
or suitability of the examinee. testing errors should not be surprising unless
the proportion of errors can be shown as
Abrams (1975) showed that polygraph exceeding the 95% confidence interval for
test accuracy was reduced significantly with normally expected error rates.
the level of functional maturity for young
juveniles. In other studies, Abrams and In response to concerns that
Weinstein (1974) showed the polygraph examinations conducted under friendly
cannot be expected to be accurate with circumstances, such as those conducted
subjects who have chronic mental health under attorney-client privilege, have less
diagnoses within the psychotic spectrum of validity than those conducted by law
disorders, and further showed that polygraph enforcement examiners, Honts and Peterson
accuracy is unstable for people whose (1997) described flawed logic and reliance on
intellectual abilities are below the lower limit a false hypothesis as the basis of this
of the normal range (Abrams, 1974). concern, and summarized the findings
reported by Honts (1997) who investigated
Although developmental problems, low the hypothesis through logic, case analysis,
intellectual functioning, low functional and meta-analysis. At the present time there
maturity, and psychosis can adversely affect is no basis of evidence to support and
polygraph accuracy; there is no evidence that available evidence contradicts the notion that
psychopathic personality issues will adversely exams conducted under attorney client
affect polygraph test accuracy. Barland and privilege offer reduced accuracy or validity.
Raskin (1975) studied criminal suspect with These findings underscore the value of
high psychopathic deviate scores on MMPI structured quantitative methods and the
testing and showed no significant differences importance of objectivity and reproducibility
in the ability to detect deception. Patrick and when analyzing test data.
Iacono (1989) also showed no significant
differences in the detection of deception A final concern regarding polygraph
among psychopathic and non-psychopathic test accuracy involves the possibility that
inmates. Raskin and Hare (1978) reported the countermeasures (i.e., faking) might be
same conclusion with a different sample of effective at altering the test outcome. This
inmate subjects. Balloun and Holmes (1979) hypothesis represents an important concern
showed that polygraph accuracy using a for U.S. government and operational security
guilty knowledge test paradigm was also not programs, as well as for law enforcement pre-
significantly different for college students employment screening, and post-conviction
with high and low psychopathic deviant supervision screening tests with convicted
scores on MMPI testing. offenders. Despite the importance of this
concern, only a small number of studies have
Although both the Office of been published on the topic of faking and
Technology Assessment (1983) and the polygraph accuracy.
National Research Council (2003) expressed
concern at the notion that polygraph test Rovner (1979; 1986) and Rovner,
accuracy may be lower for persons with Raskin and Kircher (1979) showed that
dangerous personality profiles, both reported having access to information about the
that the published scientific evidence does polygraph technique was insufficient to
not support, and consistently refutes, the significantly alter test accuracy. A concerning
hypothesis that psychopaths believe their lies finding in these studies was that truthful
and can therefore defeat the polygraph. In subjects who attempted to employ

Polygraph, 2015, 44(1) 45


Nelson

countermeasures, in attempt to increase their in the production of more deceptive test


assurance of passing, actually increased their scores. Somatic activity sensors, and testing
likelihood of being classified as deceptive. procedures intended to elucidate both mental
These findings lead the National Research and physical countermeasure, become more
Council (2003) to conclude that widely used following these studies, and
countermeasure use by truthful examinees replication of these studies is needed using
was not advisable. contemporary testing instrumentation and
methodologies.
Timm (1991) reported post-hypnotic
suggestion to be ineffective as a polygraph Activity sensors are designed to be
countermeasure, while Ben-Shakhar and sensitive to somatic/behavioral nervous
Dolev (1996) along with Elaad and Ben- system activity while remaining robust
Shakhar (1991) suggested that mental efforts against recording the effects of ANS activity of
may have an effect on the electrodermal interest to polygraph test data analytic
channel primarily. Studies by Iacono, models. The rationale for activity sensors
Boisvenu and Fleming (1984) and Iacono, involves the fact that polygraph component
Cerri, Patrick and Fleming (1992) showed sensors, while intended to be sensitive to
benzodiazepines and stimulant medications sympathetic autonomic nervous system
to be ineffective countermeasures, though an activity, are non-robust against also
earlier study by Waid, Orne and Orne (1981) recording the effects of somatic activity.
indicated that meprobamate may adversely Stephenson and Barry, (1986) along with
affect polygraph results. An inherent Ogilvie and Dutton (2008) showed that the
limitation to our acquisition of additional addition of an activity sensor can increase the
knowledge in this area will be that obtaining detection of somatic activity, and this may
ethics committee approval to fully explore the reduce the occurrence of false accusations of
effects of drugs or psychiatric medication on countermeasure use. It is assumed that
polygraph test results may be difficult or observable activity indicates that the
unlikely. recorded polygraph data is likely to be an
adulterated composite of both autonomic and
Concerning results regarding somatic activity. Correspondingly, the
polygraph countermeasures have been absence of somatic activity would indicate the
described by Honts (1987), Honts, Amato & recorded autonomic data is most likely
Gordon, (2004); Honts and Hodes (1983), unaltered and authentic.
Honts, Hodes and Raskin (1985), Honts,
Raskin and Kircher (1987), and by Honts, Statistical methods have not yet been
Raskin, Kircher and Hodes (1988) whose widely exploited in countermeasure detection.
collective work began to suggest that human However, the OSS-3 algorithm (Nelson et al.,
polygraph experts are not as effective as they 2008) includes a procedural requirement to
claim at differentiating countermeasure use review data for interpretable data quality and
from other artifacts. Additionally, Honts et al., mark any segments of data that are artifacted
(1988) and Honts and Reavy (2009) found by movement or other problem activity. The
that spontaneous countermeasure use was OSS-3 algorithm will aggregate the number
not uncommon among both deceptive and and location of indicated artifact events and
truthful examinees. Raskin and Kircher then calculate the statistical probability that
(1990) reported that training in physical observed artifacts have occurred due to
countermeasures can reduce polygraph test random causes. The algorithm will alert the
accuracy, and Honts, Raskin and Kircher examiner to the possibility that an examinee
(1994) reported mental and physical may have attempted to systematically or
countermeasures as equally effective. In a intentionally alter the recorded physiological
different study, Honts, Winbush and Devitt data whenever the likelihood falls below an
(1994) reported that mental countermeasures established alpha boundary for statistical
can be used to defeat guilty knowledge tests. significance. Statistical methods have not yet
In another study, Honts and Amato (2001) been exhaustively studies, and additional
again reported that countermeasures research is needed regarding the application
attempts by truthful subjects again resulted of statistical and computational methods to

46 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

detect countermeasure attempts. type would also require that the adulteration
of both diagnostic and error variance in the
Strategies have been described recorded physiological data is accomplished
involving both covert physical/muscle activity in a manner that will convince trained
and mental activities. An alternative, social, examiners of the authentic quality of the
countermeasure strategy will be an attempt recorded data. The inherent difficulty of this
to convince the examiner to ignore recorded challenge is made more difficult by the fact
indicators of deception as the result of some that polygraph testing, as with other forms of
alternative cause. testing, can include mechanisms and
analytical procedures designed to quantify
Other forms of potential the probability that a persons has attempted
countermeasures may involve the use of to engage in countermeasures.
medications or drugs, sleep deprivation or
physical exhaustion, and the use of mental In response to the complexity of the
efforts such as hypnosis, meditation or metal issues, assertions and conclusions surroun-
activity. In general, polygraph countermeasu- ding the potential for countermeasures
res can be expected to attempt to either during polygraph testing, the National
dampen or exaggerate responses to the entire Research Council (2003) wrote the following:
set of test questions, resulting in an
increased likelihood of an inconclusive test “Because the effective application
result. Alternatively, examinees may attempt of mental or physical
to strategically alter responses to relevant or countermeasures on the part of
comparison test stimuli. Because it appears examinees would require skill in
unwise to exaggerate responses to relevant distinguishing between relevant
stimuli, and because the reliable suppression and comparison questions, skill in
of responses to only some test questions will regulating physiological response,
present non-trivial complexities to the and skill in concealing
examinee, polygraph countermeasure countermeasures from trained
strategies will most likely involve attempts to examiners, claims that it is easy to
either dampen or disrupt the test as a whole train examinees to “beat” both the
or to augment or increase responses to polygraph and trained examiners
comparison questions stimuli. require scientific supporting
evidence to be credible. However,
The goal of countermeasures or faking we are not aware of any such
attempts, in terms of test data analysis, is to research.” (p.147).
substantially alter the diagnostic and error
variance contained in the recorded data such The literature review of the National
that it is uninterpretable or such there is no Research Council (2003) was unable to
correlation between the deceptive or truthful produce evidence supporting the hypothesis
criterion state and reaction differences that that polygraph countermeasures and faking
occur in response to different types of test attempts are effective at assisting deceptive
stimulus questions. This would produce a persons to defeat comparison question
test result that is inconclusive because the polygraphs as they are presently used in field
data are either uninterpretable or not settings. However, this should not be
statistically significant. interpreted as support of the infallibility of
the polygraph, nor an assertion that nobody
A sophisticated countermeasure had ever passed a polygraph in error.
objective would propose to alter the recorded
test data such that the mathematical, Although all test paradigms may have
statistical and computational methods to some potential vulnerability to exploitation,
partition and compare the sources of variance data at this time suggest that systematic
would result in a direct reversal of the efforts to alter the polygraph results are not
valence of the correlation coefficients for the well supported by evidence. Neither are
criterion states of deception and truth-telling. claims that the polygraph test is infallible.
Successful countermeasure attempts of this Polygraph test results remain a probabilistic

Polygraph, 2015, 44(1) 47


Nelson

estimate of the degree of uncertainty provides information about the increase in


surrounding a categorical conclusion. What decision accuracy across the entire range of
is known at this time is that virtually all possible base-rates.
guilty or deceptive persons who agree to
undergo polygraph testing may attempt to Honts and Schweinle (2009) showed
engage in some form of activity in an attempt that diagnostic polygraph significantly
to achieve a negative test result. increase the decision accuracy for both
deceptive and truthful examinees, peaking at
Polygraph has been shown to be an 27 times the accuracy of unassisted
effective, even if imperfect, tool for decisions, under assumed high base-rate
discriminating deception and truth-telling conditions, with statistically significant
even in field studies involving persons who increases in decision accuracy, compared to
were suspected of actual, sometimes serious, unassisted expert judgment, from base-rates
crimes - some of whom can be assumed to .01 to .97 for deceptive outcomes and .03 to
have attempted to engage in some form of .99 for truthful outcomes. Screening
countermeasure attempts to pass the test polygraphs, often conducted under assumed
while lying. However, even a remote potential low base-rate conditions, showed a
for effective countermeasure use - and the statistically significant increase in the
relationship of this to polygraph accuracy in accuracy of decisions regarding deception,
government operational security, law but no significant increase in the accuracy of
enforcement pre-employment, and post- decisions regarding truth telling. The increase
conviction supervision of convicted offenders in decision accuracy for deceptive outcomes
in community settings - will mean that the was statistically significant, compared to
interplay between countermeasure attempts unassisted lie-detection, from base-rates .02
and the authenticity of recorded test data will to .83, and was statistically significant for
remain an important area of concern. truthful outcomes from base-rates .11 to .99.
Regardless of the effectiveness or
ineffectiveness of countermeasures and Handler, Honts and Nelson (2013)
faking attempts, additional and continuous further evaluated the IGI statistic using
research is needed to more fully understand polygraph screening tests of the type used in
the vulnerability to countermeasures of operational security, law-enforcement
contemporary polygraph testing procedures. screening, and post-conviction sex offender
supervision programs. Handler et al., showed
Contribution of polygraph results to statistically significant increases in the
professional decisions accuracy of screening decisions, compared to
unassisted lie detection, from base-rates .01
In an attempt to quantify the value to .94 for deceptive outcomes and from base-
and contribution of polygraph results to rates .07 to .99 for truthful outcomes. This
professional judgment, Honts and Schweinle suggests that effective use of polygraph
(2009) described the use of the Information testing, though probabilistic and imperfect,
Gain Index (IGI, Wells & Olson, 2002) to has the potential to increase the effectiveness
diagnostic and screening polygraphs. IGI is a of professional decision making.
measurement of the increase in decision
accuracy that results from method, and offers Conclusions
the advantage of providing information across
the spectrum of prior base-rates – something Evidence exists to support the
other statistical metrics have failed to do scientific validity of polygraph testing in both
efficiently. This method compares the diagnostic and screening contexts, and there
increase in decision accuracy provided by is sufficient evidence to warrant continued
polygraph results, with unassisted interest in both research and practice of the
professional judgment – for which Vrij (2008) instrumental and statistical discrimination of
showed that police officers achieve a decision deception and truth telling in both forensic
accuracy of 56% when attempting to and screening programs. Available evidence
determine deception or truth telling. The can describe and account for virtually all
importance of the IGI statistic is that it aspects of the polygraph test, including

48 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

general test theory, testing procedures, phenomena that cannot be subjected to


decision theory, and signal detection theory, physical measurement. Tests are not needed
test accuracy, and vulnerability to when direct mechanical or linear
countermeasures or faking, along with the measurement is possible; in which case we
psychological and physiological basis of simply measure the item of interest.
responses to polygraph stimuli. Measurement, as compared to testing,
involves mechanical measurement error.
The polygraph test, like other Testing of any type may involve human
scientific tests, is a probabilistic test that sources of variance and other random or
involves the recording of physiological uncontrolled sources of error variance in
responses to stimuli and uses statistical addition to the diagnostic variance contained
decision theory to quantify the margin of within and expressed by the testing data.
error or level of statistical significance – or
alternatively the odds or confidence level - All test data, including polygraph test
associated with the test result (Nelson, data, are a combination of diagnostic
2014a; 2014b; 2014c, 2014d; 2014e). The variance (also referred to as explained
polygraph test achieves its objectives through variance, controlled variance, or signal) and
the structural combination of physiological error variance (also referred to as random
responses that have been shown to be variance, unexplained variance, uncontrolled
reliable proxies that are correlated at variance, or noise). Ideally, testing data will
statistically significant levels with differences include a large portion of diagnostic variance
in responses that are loaded onto different and a small portion of error variance, but no
types of test stimuli (i.e., RQs and CQs) as a test is perfect. Scientific tests are expected
function of deception and truth-telling only to quantify and account for the margin
regarding a past behavior. of error surrounding a test result, and to
account for the basis of assumptions related
The need for a test that can to the testing procedures.
discriminate deception and truth-telling
arises from the fact that evidence may not Because scientific tests are used to
exist, or may not yet have been uncovered to evaluate amorphous phenomena that cannot
enable a deterministic conclusion or physical be subjected to deterministic observation or
measurement. A probabilistic test of physical measurement, all test results are
deception – with accuracy significantly probabilistic – including when they are
greater than both chance and unassisted lie simplified to categorical test results.
detection - is the scientific alternative to the Quantification of the margin of error or level
near chance accuracy rates of unaided of statistical significance associated with a
human inference. Scientific tests are needed test result will enable referring professionals
whenever a perfect deterministic observation and consumers of testing results to make
is not possible. better-informed conclusions about the
meaning and usefulness of a test result.
Tests are often needed to make
informed conclusions about events in the Concerns about the ethics of
past, which can no longer be observed polygraph testing, and especially polygraph
directly or deterministically, and also to screening programs, have sometimes pointed
understand the potential for future events to the lack of perfection and the false positive
which have not yet occurred and which error rate as the basis for argument against
therefore cannot yet be directly or the use of the polygraph. Expectations for
deterministically observed. Polygraph test deterministic perfection – in which test
results refer to both the likelihood that a past results are not affected by uncontrolled
behavior has occurred, and to the future variance, human behavior, or random error –
potential that information or evidence will be are not realistic, and frustration or
uncovered to confirm a conclusion. disappointment regarding a lack of
deterministic perfection are not warranted in
Scientific tests are also needed when a scientific testing context.
there is a desire to measure an amorphous

Polygraph, 2015, 44(1) 49


Nelson

It is important to remember that one from related fields of science.


of the operational goals of testing - in
medicine, psychology, forensics, polygraph Practical areas for which more
programs and other testing contexts - is the information is needed include the accuracy of
reduction of harm resulting from both false- both diagnostic and screening polygraphs,
positive and false-negative errors. Any the contribution of the results of diagnostic
practical testing method that achieves polygraphs to investigation outcomes, and
accuracy significantly greater than chance the contribution of screening polygraphs to
has the potential for reducing such harm if case and program outcome measures such as
used effectively. It is also important to note rule-violations, corruption, dereliction and
that screening tests of some types may be recidivism. As long as some persons are
intended to slightly over-predict the presence motivated to engage in deception, and as long
of problems – with the goal of correcting as others attempt to use scientific
errors with subsequent diagnostic testing. technologies to detect deception, there will be
While false-positive errors can be identified some who are interested in developing
and corrected with additional testing and countermeasure strategies to evade detection.
investigation, identification of false negative For this reason, there will be continued and
errors is sometimes not possible until a ongoing interest in developing knowledge
problem has escalated to a degree that can regarding the potential vulnerabilities of the
sometimes permanently affect individual lives polygraph test, and what additional methods
and futures. It is equally important to can be applied to counter those
remember that neither polygraph test results, vulnerabilities.
nor any form of test result, should be used
alone as the basis for decisions that affect the Though it is colloquially referred to as
rights and liberties of individuals. There are a “lie detector” test as a term of convenience,
no published policies or standards of practice science and scientific reason do not suppose
for screening polygraphs that advise or that the polygraph actually measures lies per
require the use of polygraph test results se. All test results are probability statements.
alone as a sufficient basis for professional The alternative to a probabilistic understand-
decision-making. ding of polygraph test results is to encourage
false expectations and frustration that the
As always, more information and lies themselves can somehow be subjected to
research is desirable in some areas, deterministic observation or physical measu-
pertaining to both theoretical constructs and rement. In reality, the principles of physiology
practical concerns. Theories are satisfactory and psychology are sufficiently complex and
only as long as they account for known and variable that a probabilistic model is
observed phenomena. The emergence of any necessary and unavoidable.
evidence or phenomena that is not accounted
for by our current theories should be taken Because lies per se are amorphous
as an indicator of the need and obligation to temporal events, lie detection will likely
continue to revise our knowledge and remain a probabilistic and imperfect task. It
assumptions in response to the new will be important to remain aware that the
information. Failure to make revisions to goals of scientific testing are often to quantify
working theories is an indicator of stasis, and and measure phenomena that cannot be
is a characteristic of pseudoscientific subjected to deterministic observation or
endeavors. Scientists are continuously mechanical measurement. Polygraph test
upgrading all working theories in response to results are a measurement of the uncertainty
an ever-increasing volume of known and surrounding a categorical conclusion based
observed phenomena. All theories in the on differences in responses that are loaded
realm of science are expected to evolve over onto different types of test stimuli as function
time and to move towards an integrated of deception or truth-telling regarding a
framework with other theories from other behavioral concern.
fields of science. For this reason, it will be
important for the polygraph profession to
continue to make use of new information

50 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

References

Abdi, H. (2007). Bonferroni and Šidák corrections for multiple comparisons. In N.J. Salkind (Ed.),
Encyclopedia of Measurement and Statistics. Sage.

Abrams, S. (1973). Polygraph validity and reliability: A review. Journal of Forensic Sciences, 18,
313-326.

Abrams, S. (1974). The validity of the polygraph with schizophrenics. Polygraph, 3, 328-337.

Abrams, S. (1975). A response to Lykken on the polygraph. American Psychologist, 30, 709-711.

Abrams, S. (1977). A polygraph handbook for attorneys. Lexington, MA: Lexington Books.

Abrams, S. (1989). The complete polygraph handbook. Lexington, MA: Lexington Books.

Abrams, S. (. (1984). The question of the intent question. Polygraph, 13, 326-332.

Abrams, S. & Weinstein, E. (1974). The validity of the polygraph with retardates. Journal of Police
Science and Administration, 2, 39766.

American Polygraph Association (2009a). Model Policy for Post-conviction Sex Offender Testing.
[Electronic version] Retrieved January 25, 2012, from http://www .polygraph.org.

American Polygraph Association (2009b). Model Policy for Law Enforcement Public Services Pre-
Employment Screening Examinations [Electronic version]1/25/2012, from http://www
.polygraph.org.

American Polygraph Association (2011a). By-laws: American Polygraph Association, effective 1-1-
2012. [Electronic version] Retrieved January 6, 2012, from http://www .polygraph.org.

American Polygraph Association (2011b). Meta-analytic survey of criterion accuracy of validated


polygraph techniques. Polygraph, 40(4), 196-305.

Amsel, T. T. (1999). Exclusive or nonexclusive comparison questions: A comparative field study.


Polygraph, 28, 273-283.

Anderson, C. A., Lindsay, J. J., & Bushman, B. J. (1999). Research in the psychological
laboratory: Truth or triviality? Current Directions In Psychological Science, 8, 3-9.

Ansley, N. (1983). A compendium on polygraph validity. Polygraph, 12, 53-61.

Ansley, N. (1990). Law notes: Civil and criminal cases. Polygraph, 19, 72-102.

Ansley, N. (1999, July). The frequency of appearance of evaluative criteria in polygraph charts.
Defense Personnel Research Center.

Ansley, N. & Krapohl, D.J. (2000). The frequency of appearance of evaluative criteria in field
polygraph charts. Polygraph, 29, 169-176.

ASTM (2002). Standard Practices for Interpretation of Psychophsysiological Detection of Deception


(Polygraph) Data (E 2229-02). ASTM International.

Backster, C. (1963). Standardized polygraph notepack and technique guide: Backster zone
comparison technique. Cleve Backster: New York.

Polygraph, 2015, 44(1) 51


Nelson

Balloun, K. D. & Holmes, D.S. (1979). Effects of repeated examinations on the ability to detect guilt
with a polygraphic examination: A laboratory experiment with a real crime. Journal of
Applied Psychology, 64, 316-322.

Barland, G. H. (1981). A Validity and Reliability Study of Counterintelligence Screening Test. Fort
George G. Meade, Maryland: Security Support Battalion, 902d Military Intelligence Group.
[Reprinted in Polygraph 41 (1) 1-27].

Barland, G. H., Honts, C. R. & Barger, S.D. (1989). Studies of the accuracy of security screening
polygraph examinations. DTIC AD Number A304654. Department of Defense Polygraph
Institute.

Barland, G. H. & Raskin, D.C. (1975a). An evaluation of field techniques in detection of deception.
Psychophysiology, 12 (3), 321-330.

Barland, G. H. & Raskin, D.C. (1975a). Psychopathy and detection of deception in criminal
suspects. Psychophysiology, 12, 224.

Barry, R. J. (1996). Preliminary process theory: Towards an integrated account of the


psychophysiology of cognitive processes. Acta Neurobiologiae Experimentalis, 56(1), 469-484.

Bear, M. F., Barry, W. C. & Paradiso, M.A. (2007). Neuroscience: Exploring the brain: Third
Edition. Philadelphia, PA: Lippincott Williams & Wilkins.

Bell, B. G., Raskin, D. C., Honts, C. R. & Kircher, J.C. (1999). The Utah numerical scoring system.
Polygraph, 28 (1), 1-9.

Ben-Shakhar, G. & Dolev, K. (1996). Psychophysiological detection through the guilty knowledge
technique: Effects of mental countermeasures. Journal of Applied Psychology, 81, 273-281.

Berntson, G. G., & Cacioppo, J. T. (2007). Integrative physiology: Homeostasis, allostasis and the
orchestration of systemic physiology, in Cacioppo, J. T., Tassinary, L. G., and Berntson, G.
G. (Eds.) Handbook of Psychophysiology, 3rd edition (pp. 433-449). New York: Cambridge
University Press.

Birbaumer, N., Veit, R., Lotze, M., Erb, M., Hermann, C., Grodd, W. & Flor, H. (2005). Deficient
fear conditioning in psychopathy: a functional magnetic resonance imaging study. Archives
of general psychiatry, 62, 799-805.

Blackwell, J. N. (1998). PolyScore 33 and psychophysiological detection of deception examiner rates


of accuracy when scoring examination from actual criminal investigations. Available at the
Defense Technical Information Center. DTIC AD Number A355504/PAA. [Reprinted in
Polygraph, 28 (2), 149-175].

Blalock, B., Nelson, R., Handler, M. & Shaw, P. (2011). A position paper on the use of directed lie
comparison questions in diagnostic and screening polygraphs. Police Polygraph Digest,
(2011), 2-5.

Blalock, B., Nelson, R., Handler, M. & Shaw, P. (2012). The empirical basis for the use of directed
lie comparison questions in diagnostic and screening polygraphs. APA Magazine, 45(1), 36-
39.

Bracha, H. S., Ralston, T. C., Matsukawa, J. M., Williams, A. E. & Bracha, A.S. (2004). "Does
"Fight or Flight" need updating? Psychosomatics, 45 (5), 448–449.

52 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

Bradley, M. T. & Janisse, M.P. (1981). Accuracy demonstrations, threat, and the detection of
deception: Cardiovascular, electrodermal, and pupillary measures. Psychophysiology, 18,
307-315.

Boucsein, W. (2012). Electrodermal activity. New York: Plenum Press.

Campion, M., Campion, J. & & Hudson, J. (1994). Structured Interviewing: A note on incremental
validity and alternative question types. Journal of Applied Psychology, 79, 998-1002.

Cannon, W. B. (1929). Bodily changes in pain, hunger, fear, and rage. New York: Appleton-Century-
Croft

Capps, M. H. (1991). Predictive value of the sacrifice relevant. Polygraph, 20, 1-6.

Costanzo, L. (2007). Physiology. Hagerstwon, MD: Lippincott Williams & Wilkins.Department of


Defense (2004a). Federal psychophysiological detection of deception examiner handbook.
[Retrieved from http://www.antipolygraph.org on 2-23-2008].

Crewson, P. E. (2001). A comparative analysis of polygraph with other screening and diagnostic
tools. Research Support Service. Report No. DoDPI01-R-0003. Reprinted in Polygraph 32
(57-85).

Department of Defense (2004b). Test data analysis: DoDPI numerical evaluation scoring system.
[Retreived from http://www.antipolygraph.org on 6-28-2007].

Department of Defense (2006a). Federal psychophysiological detection of deception examiner


handbook. Reprinted in Polygraph, 40 (1), 2-66.

Department of Defense (2006b). Test data analysis: DoDPI numerical evaluation scoring system.
[Retrieved from http://www.antipolygraph.org on 3-31-2007].

Department of Defense Polygraph Institute (2002). Department of Defense Polygraph Institute Law
Enforcement Pre-employment Test. [Retrieved from http://www.antipolygraph.org on 6-2-
2011].

Dollins, A. B., Krapohl, D. J. & Dutton, D. (1999). A comparison of computer programs designed to
evaluate psychophysiological detection of deception examinations: Bakeoff 1. Department of
Defense Polygraph Institute. Reprinted in Polygraph 29 (237-247).

Domjan, M. (2010). Principles of learning and behavior, 6th edition. Cengage/Wadsworth.

Drever, E. (1995). Using Semi-Structured Interviews in Small-Scale Research. A Teacher's Guide.


Scottish Council for Research in Education, Edinburgh.

Elaad, E. & Ben-Shakhar, G. (1991). Effects of mental countermeasures on psychophysiological


detection in the guilty knowledge test. International journal of psychophysiology: official
journal of the International Organization of Psychophysiology, 11, 99-108.

Furedy, J. J. (1989). The North American CQT polygraph and the legal profession: a case of
Canadian credulity and a cause for cultural concern. The Criminal Law Quarterly, 31, 431-
451.

General Accounting Office (1991). Using structured interviewing techniques. Program Evaluation
and Methodology Division [Retrieved online http://www.gao.gov/policy/10_1_5.pdf on 10-
20-2009].

Polygraph, 2015, 44(1) 53


Nelson

Green, D. M. & Swets. J A (1966). Signal detection theory and psychophysics. New York: Wiley.

Greenberg, I. (1982). The role of deception in decision theory. Journal of Conflict Resolution, 26,
139-156.

Groves, P. M. & Thompson, R.F. (1970). Habituation: A dual-process theory. Psychological Review,
77, 419-450.

Handler M. (2006). The Utah PLC. Polygraph, 35, 139-148.Handler, M. & Nelson, R. (2007).
Polygraph terms for the 21st Century. Polygraph, 36, 157-164.

Handler, M., Geddes, L., Reicherter, J. (2007). A discussion of two diagnostic features of the
cardiovascular channel. Polygraph, 36 (2), 70-83.

Handler, M., Honts, C., Krapohl, D., Nelson, R. & Griffen, S. (2009). Integration of pre-employment
polygraph screening into the police selection process. Journal of Police and Criminal
Psychology, 24, 69-86.

Handler, M. & Krapohl, D. (2007). The use and benefits of the photoelectric plethysmograph in
polygraph testing. Polygraph, 36, 18-25.

Handler, M. & Nelson, R. (2007). Polygraph terms for the 21st Century. Polygraph, 36, 157-164.

Handler, M. & Nelson, R. (2008). Utah approach to comparison question polygraph testing.
European Polygraph, 2, 83-110.

Handler, M. Honts, C. & Nelson, R. (2013). Information gain of the Directed Lie Screening Test.
Polygraph, 42, 192-202.

Handler, M., Nelson, R. & and Blalock, B. (2008). A focused polygraph technique for PCSOT and
law enforcement screening programs. Polygraph, 37 (2), 100-111.

Handler, M., Nelson, R., Krapohl, J. & Honts, C. (2010). An EDA Primer for Polygraph Examiners.
Polygraph, 39, 68-108.

Handler, M., & Reicherter, J. (2008). Respiratory blood pressure fluctuations observed during
polygraph examinations. Polygraph, 37 (4), 256-262.

Handler, M., Rovner, L. & Nelson, R. (2008). The concept of allostasis in polygraph testing.
Polygraph, 38, 228-233.

Handler, M., Shaw, P. & Gougler, M. (2010). Some thoughts about feelings: A study of the role of
cognition and emotion in polygraph testing. Polygraph, 39, 139-154.

Harris, J. C. & Olsen, D.E. (1994). Polygraph Automated Scoring System. Patent Number:
5,327,899. U.S. Patent and Trademark Office.

Harris, J., Horner, A. & McQuarrie, D. (2000). An evaluation of the criteria taught by the department
of defense polygraph institute for interpreting polygraph examinations. Johns Hopkins
University, Applied Physics Laboratory. SSD-POR-POR-00-7272.

Hastie, T., Tibshirani, R. & Friedman, J. (2001). Elements of statistical learning. Springer.

Hilliard, D. L. (1979). A cross analysis between relevant questions and a generalized intent to
answer truthfully question. Polygraph, 8, 73-77.

54 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

Hodgson, P. (1987). A practical guide to successful interviewing. Maidenhead: McGraw-Hill.

Honts, C. R. (1987). Interpreting research on polygraph countermeasures. Journal of Police Science


and Administration, 15, 204-209.

Honts, C. R. (1997). Is it time to reject the friendly polygraph examiner hypothesis (FPEH)? Paper
presented at the American Psychological Society at the 9th annual meeting, Washington, D.
C., May 23-26, 1997.

Honts, C. R. & Alloway, W.R. (2007). Information does not affect the validity of a comparison
question test. Legal and Criminological Psychology, 12, 311-320.

Honts, C. R. & Amato, S. (2001). Psychophysiological credibility assessment. Journal of Forensic


Psychology Practice, 1, 87-99.

Honts, C., Amato, S. & Gordon, A. (2004). Effects of outside issues on the comparison question
test. Journal of General Psychology, 131 (1), 53-74.

Honts, C. R. & Devitt, M.K. (1992). Bootstrap decision making for polygraph examinations.
Department of Defense Polygraph Institute report No DoDPI92-R-0002.

Honts, C. & Handler, M. (2014). Scoring respiration when using directed lie comparison questions.
Polygraph, 43 (3) 71-78.

Honts, C. Handler, M. Shaw, P. & Gougler, M. (2015). The vasomotor response in the comparison
question test. Polygraph 44 (1) 62-78.

Honts, C. R. & Hodes, R.L. (1983). The detection of physical countermeasures. Polygraph, 12, 7-
17.

Honts, C. R., Hodes, R. L. & Raskin, D.C. (1985). Effects of physical countermeasures on the
physiological detection of deception. Journal of Applied Psychology, 70, 177-187.

Honts, C. R. & Peterson, C.F. (1997). Brief of the Committee of Concerned Social Scientists as
Amicus Curiae United States v Scheffer. Available from the author.

Honts, C. R. & Raskin, D. C. (1988). A field study of the validity of the directed-lie control question.
Journal of Police Science and Administration, 16, 56-61.

Honts, C. R., Raskin, D. C. & Kircher, J.C. (1987). Effects of physical countermeasures and their
electromyographic detection during polygraph tests for deception. Psychophysiology, 1,
241-247.

Honts, C. R., Raskin, D. C. & Kircher, J.C. (1994). Mental and physical countermeasures reduce
the accuracy of polygraph tests. Journal of Applied Psychology, 79, 252-259.

Honts, C. R., Raskin, D. C., Kircher, J. C. & Hodes, R.L. (1988). Effects of spontaneous
countermeasures on the physiological detection of deception. Journal of Police Science and
Administration, 16, 91-94.

Honts, C. R. & Reavy, R. (2009). Effects of comparison question type and between test stimulation on
the validity of comparison question test. Final progress report on contract No.W911Nf-07-1-
0670, submitted to the Defense Academy of Credibility Assessment (DACA). Boise State
University.

Polygraph, 2015, 44(1) 55


Nelson

Honts, C. R., Winbush, M. & Devitt, M.K. (1994). Physical and mental countermeasures can be
used to defeat guilty knowledge tests. Psychophysiology, 31, S57.

Horneman, C. J. & O'Gorman, J.G. (1985). Detectability in the card test as a function of the
subject's verbal response. Psychophysiology, 22, 330-333.

Horowitz, S. W., Kircher, J. C. & Raskin, D.C. (1986). Does stimulation test accuracy predict
accuracy of polygraph test?. Psychophysiology, 23, 442.

Horowitz, S. W., Kircher, J. C., Honts, C. R. & Raskin, D.C. (1997). The role of comparison
questions in physiological detection of deception. Psychophysiology, 34, 108-115.

Horvath, F. & Palmatier, J. (2008). Effect of two types of control questions and two question
formats on the outcomes of polygraph examinations. Journal of Forensic Sciences, 53(4), 1-
11.

Horvath, F. S. (1988). The utility of control questions and the effects of two control question types
in field polygraph techniques. Journal of Police Science and Administration, 16, 198-209.

Horvath, F. S. (1994). The value and effectiveness of the sacrifice relevant question: An empirical
assessment. Polygraph, 23, 261-279.

Iacono, W. G., Boisvenu, G. A. & Fleming, J.A. (1984). Effects of diazepam and methylphenidate on
the electrodermal detection of guilty knowledge. Journal of Applied Psychology, 69, 289-299.

Iacono, W. G., Cerri, A. M., Patrick, C. J. & Fleming, J.A. (1992). Use of antianxiety drugs as
countermeasures in the detection of guilty knowledge. The Journal of applied psychology,
77, 60-4.

Janig. W. (2006). The Integrative Action of the Autonomic Nervous System: Neurobiology of
Homeostasis. Cambridge University Press.

Kahn, R. L. and C. F. Cannell (1957). The psychological basis of the interview. The dynamics of
interviewing: theory, technique, and cases. New York, John Wiley & Sons: 22-64.

Kahn, J., Nelson, R. & Handler, M. (2009). An exploration of emotion and cognition during
polygraph testing. Polygraph, 38, 184-197.

Kirby, S. L. (1981). The comparison of two stimulus tests and their effect on the polygraph
technique. Polygraph, 10, 63-76.

Kircher, J. C. (1981). Computerized chart evaluation in the detection of deception. Masters thesis:
University of Utah.

Kircher, J. C. (1983). Computerized decision making and patterns of activation in the detection of
deception. Dissertation Abstracts International, 44, 345.

Kircher, J. & Raskin, D. (2002). Computer methods for the psychophysiological detection of
deception. In Murray Kleiner (Ed.), Handbook of Polygraph Testing . San Diego: Academic
Press.

Kircher, J. C., Kristjiansson, S. D., Gardner, M. K. & Webb, A. (2005). Human and computer
decision-making in the psychophysiological detection of deception. University of Utah.

56 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

Kircher, J. C. & Raskin, D.C. (1988). Human versus computerized evaluations of polygraph data in
a laboratory setting. Journal of Applied Psychology, 73, 291-302.

Kircher, J. C., Packard, T., Bell, B. G. & Bernhardt, P. C., (2001). Effects of Prior Demonstrations of
Polygraph Accuracy on Outcomes of Probable-Lie and Directed-lie Polygraph Tests. Final
report to the U. S. Department of Defense Polygraph Institute, Ft. Jackson, SC. Salt Lake
City: University of Utah, Department of Educational Psychology.

Krapohl, D. J. (2002). Short report: Update for the objective scoring system. Polygraph, 31, 298-
302.

Krapohl, D. & McManus, B. (1999). An objective method for manually scoring polygraph data.
Polygraph, 28, 209-222.

Krapohl, D. J. & Ryan, A.H. (2001). Final Comment on the bleated look at symptomatic questions.
Polygraph, 30, 218-219.

Kvale, S. (1996). Interviews: An Introduction to Qualitative Research Interviewing. Sage Publications.

Lehmann, E. L. (1950). Some principles of the theory of testing hypotheses. Annals of Mathematical
Statistics, 21 (1), 1-26.

Lieblich, I., Ben-Shakhar, G., Kugelmass, S. & Cohen, Y. (1978). Decision theory approach to the
problem of polygraph interrogation. Journal of Applied Psychology, 63, 489-498.

Light, G. D. (1999). Numerical evaluation of the Army zone comparison test. Polygraph, 28, 37-45.

Lindlof, T. R. & Taylor, B.C. (2002). Qualitative communication research methods. Thousand Oaks:
Sage.

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 5-55.

Lykken, D. T. (1981). A tremor in the blood: Uses and abuses of the lie detector. New York:McGraw-
Hill.

MacLaren, V. & Krapohl, D. (2003). Objective Assessment of Comparison Question Polygraphy.


Polygraph, 32, 107-126.

Marcum, J. I. (1947). A statistical theory of target detection by pulsed radar. Rand Corporation.
[Retrieved online http://www.rand.org 2011-11-2].

Maton, A., Hopkins, J., McLaughlin, C., Johnson, S., Quon Warner, M., LHart, D. & Wright, J.
(1993). Human biology and health. Englewood Cliffs, New Jersey, USA: Prentice Hall.

National Research Council (2003). The polygraph and lie detection. Washington, D.C.: National
Academy of Sciences.

Nelson, R. (2014a). What does the polygraph measure? APA Magazine, 47(2) 39-47.

Nelson, R. (2014b). Redux: What does the polygraph measure? (in 600 words or less). APA
Magazine, 47 (3), 36-37.

Nelson, R. (2014c). Take 3: What does the polygraph measure? (in 250 words or less). APA
Magazine, 47 (4), 60.

Polygraph, 2015, 44(1) 57


Nelson

Nelson, R. (2014d). Short Answer: What does the polygraph measure? (in 150 words or less). APA
Magazine, 47 (5), 27.

Nelson, R. (2014e). Sound-bite: What does the polygraph measure? (in 50 words or less). APA
Magazine, 47 (6), 29.

Nelson, R. & Handler, M. (2010). Empirical scoring system: NPC quick reference. Lafayette
Instrument Company. Lafayette, IN.

Nelson, R. & Handler, M. (2012). Monte Carlo study of criterion validity of the directed lie
screening test using the Empirical Scoring System and the Objective Scoring System
version 3. Polygraph, 41, 145-155.

Nelson, R. & Handler, M. (2013). A brief history of scientific reviews of polygraph accuracy
research. APA Magazine, 47 (6), 22-28.

Nelson, R, & Handler, M. (2015,). Statistical reference distributions for comparison question
polygraphs. Polygraph, 44 (1), 91-114.

Nelson, R., Handler, M., Blalock, B. & Hernandez, N. (2012). Replication and extension study of
Directed Lie Screening Tests: criterion validity with the seven and three Position models and
the Empirical Scoring System. Polygraph 41 (3), 186-198.

Nelson, R., Handler, M., & Morgan, C. (2012). Criterion validity of the directed lie screening test
and the empirical scoring system with inexperienced examiners and non-naive examinees
in a laboratory setting. Polygraph 41 (3), 176-185.

Nelson, R., Handler, M., Shaw, P., Gougler, M., Blalock, B., Russell, C., Cushman, B. & Oelrich, M.
(2011). Using the Empirical Scoring System. Polygraph, 40, 67-78.

Nelson, R., Krapohl, D. & Handler, M. (2008). Brute force comparison: A Monte Carlo study of the
Objective Scoring System version 3 (OSS-3) and human polygraph scorers. Polygraph, 37,
185-215.

Offe, H. & Offe, S. (2007). The comparison question test: does it work and if so how?. Law and
human behavior, 31, 291-303.

Office of Technology Assessment (1983). The validity of polygraph testing: A research review and
evaluation. Printed in Polygraph, 12, 198-319.

Ogilvie, J. & Dutton, D. (2008). Improving the detection of physical countermeasures with chair
sensors. Polygraph, 37 (2), 136-148.

Olsen, D. E., Ansley, N., Feldberg, I. E., Harris, J. C. & Cristion, J.A. (1991). Recent developments
in polygraph technology. Johns Hopkins APL Technical Digest, 12, 347-357.

Olsen, D. E., Harris, J. C. & Chiu, W.W. (1994). The development of a physiological detection of
deception scoring algorithm. Psychophysiology, 31, S11.

Olsen, D. E., Harris, J. C., Capps, M. H. & Ansley, N. (1997). Computerized polygraph scoring
system. Journal of Forensic Sciences, 42, 61-70.

Palmatier, J. J. (1991). Analysis of two variations of control question polygraph testing utilizing
exclusive and nonexclusive controls. Unpublished doctoral dissertation.

58 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

Palmatier, J., Rovner, L (2014). Credibility assessment: Preliminary process theory, the polygraph
process, and construct validity. International Journal of Psychophysiology, [available online:
http://www.ncbi.nlm.nih.gov/pubmed/24933412].

Paradiso, M. A., Bear, M. F. & Connors, B.W. (2007). Neuroscience: Exploring the Brain.
Hagerstwon, MD: Lippincott Williams & Wilkins.

Patrick, C. J. & Iacono, W.G. (1989). Psychopathy, threat and polygraph test accuracy. Journal of
Applied Psychology, 74, 347-355.

Pavlov, I. P. (1927). Conditioned reflexes: An investigation of the physiological activity of the cerebral
cortex. Sage Publications Inc. 2009.

Podlesny, J. A. & Raskin, D. C. (1978). Effectiveness of techniques and physiological measures in


the detection of deception. Psychophysiology, 15, 344-359.

Podlesny, J., Raskin, D. & Barland, G. (1976). Effectiveness of techniques and physiological
measures in the detection of deception. Report No. 76-5, Contract 75-N1-99-001 LEAA
(available through Department of Psychology, University of Utah, Salt Lake City).

Popper, K. R. (1959). The logic of scientific discovery. Routledge.

Porges, S. (2011). The Polyvagal Theory: Neurophysiological Foundations of Emotions, Attachment,


Communication, and Self-regulation. Norton.

Pratt, J., Raiffa, H. & Schlaifer, R. (1995). Introduction to statistical decision theory. Cumberland, RI:
MIT Press.

Pollina, D. A., Dollins, A. B., Senter, S. M., Krapohl, D. J. & Ryan, A. H. (2004). Comparison of
polygraph data obtained from individuals involved in mock crimes and actual criminal
investigations. Journal of applied psychology, 89, 1099-105.

Popper, K. R., (1959). The Logic of Scientific Discovery. The German version is currently in print by
Mohr Siebeck (ISBN 3-16-148410-X), the English one by Routledge publishers (ISBN 0-415-
27844-9).

Powell, M. B. & Snow, P.C. (2007). Guide to questioning children during the free-narrative phase of
an investigative interview. Australian Psychologist, 42 (1), 57-65.

Raffle, A. E. & Muir Gray, J.A. (2007). Screening. Oxford University Press.

Raskin D. C., Honts. C. R. (2002). The comparison question test. In M. Kleiner (Ed.), Handbook of
Polygraph Testing. San Diego:Academic Press.

Raskin, D. C., Honts, C. R., Kircher, J. C. (2014). Credibility assessment: Scientific research and
applications. Academic Press.

Raskin, D. C. & Hare, R.D. (1978). Psychopathy and detection of deception in a prison population.
Psychophysiology, 15, 126-136.

Raskin, D. C. & Kircher, J.C. (1990, May 2). Development of a computerized polygraph system and
physiological measures for detection of deception and countermeasures: A pilot study.
Scientific Assessment Technologies, Inc.

Polygraph, 2015, 44(1) 59


Nelson

Raskin, D. C. & Podlesny, J.A. (1979). Truth and deception: A reply to Lykken. Psychological
Bulletin, 86, 54-59.

Raskin, D., Kircher, J. C., Honts, C. R. & Horowitz, S.W. (1988). A study of the validity of polygraph
examinations in criminal investigations. Final Report, National Institute of Justice, Grant
No. 85-IJ-CX-0040.

Reid, J. E. (1947). A revised questioning technique in lie detection tests. Journal of Criminal Law
and Criminology, 37, 542-547. Reprinted in Polygraph 11, 17-21.

Research Division Staff (1995a). Psychophysiological detection of deception accuracy rates obtained
using the test for espionage and sabotage. DTIC AD Number A330774. Department of
Defense Polygraph Institute. Fort Jackson, SC. Reprinted in Polygraph, 27, (3), 171-180.

Research Division Staff (1995b). A comparison of psychophysiological detection of deception


accuracy rates obtained using the counterintelligence scope Polygraph and the test for
espionage and sabotage question formats. DTIC AD Number A319333. Department of
Defense Polygraph Institute. Fort Jackson, SC. Reprinted in Polygraph, 26 (2), 79-106.

Richerter J. & Handler, M. (2014). A physiology manual for PDD lifelong learners of the science.
Available from the authors. Retrieved online at
[http://www.polygraph.org/files/apa_psychophysiology_study_guide.pdf] on 11-30-2014.

Rovner, L. I. (1979). The effects of information and practice on the accuracy of physiological detection
of deception. Unpublished doctoral dissertation: University of Utah.

Rovner, L. I. (1986). Accuracy of physiological detection of deception for subjects with prior
knowledge. Polygraph, 15 (1), 1-39.

Rovner, L. I., Raskin, D. C. & Kircher, J.C. (1979). Effects of information and practice on detection
of deception. Psychophysiology, 16, 197-198 (abstract).

Standring, S. (2005). Gray's anatomy (39th ed.). Elsevier Churchill Livingstone.

Saxe, L. (1991). Science and the CQT polygraph: A theoretical critique. Integrative Physiological and
Behavioral Science, 26, 223-231.

Schonhoff, T. A. & Giordano, A.A. (2006). Detection and estimation theory and its applications.
New Jersey: Pearson Education.

Senter, S. M. (2003). Modified general question test decision rule exploration. Polygraph, 32, 251-
263.

Senter, S. M. & Dollins, A.B. (2002). New decision rule development: Exploration of a two-stage
approach. Report number DoDPI00-R-0001. Department of Defense Polygraph Institute
Research Division, Fort Jackson, SC.

Senter, S. M. & Dollins, A.B. (2008). Optimal decision rules for evaluating psychophysiological
detection of deception data: an exploration. Polygraph, 37 (2), 112-124.

Senter, S., Weatherman, D., Krapohl, D. & Horvath, F. (2010). Psychological set or differential
salience: A proposal for reconciling theory and terminology in polygraph testing. Polygraph,
39 (20), 109-117.

60 Polygraph, 2015, 44(1)


Scientific Basis of Polygraph

Silverthorn, D. U. (2009). Human physiology: An integrated approach (4 ed.). Pearson/Benjamin


Cummings.

Stephenson, M. & Barry, G. (1986). Use of a motion chair in the detection of physical
countermeasures. [Reprinted in Polygraph, 17, 21-27].

Sterling, P. and Eyer, J. (1988) Allostatsis: a new paradigm to explain arousal pathology. In:
Fisher, S., Reason, J. (Eds.) Handbook of Life Stress, Cognition and Health. New York: Wiley
and Sons.

Summers, W. G. (1939). Science can get the confession. Fordham Law Review, 8, 334-354.

Swets, J. A. (1964). Signal detection and recognition by human observers. New York: Wiley.

Swets, J. A. (1996). Signal detection theory and ROC analysis in psychology and diagnosis.
Mahwah, NJ: Erlbaum.

Tanner, J., Wilson, P. & Swets, J. (1954). A decision-making theory of visual detection.
Psychological Review, 61 (6), 401-409.

Taylor, S. E., Klein, L. C., Lewis, B. P., Gruenewald, T. L., Gurung, R. A. & Updegraff, J.A. (2000).
Biobehavioral responses to stress in females: Tend-and-befriend, not fight-or-flight.
Psychological Review, 107, 411-429.

Timm, H. W. (1982). Analyzing deception from respiration patterns. Journal of Police Science and
Administration, 10, 47-51.

Timm, H. W. (1991). Effect of posthypnotic suggestions on the accuracy of preemployment


polygraph testing. Journal of Forensic Sciences, 36, 1521-1535.

Vrij, A. (2008). Detecting Lies and Deceit, Pitfalls and Opportunities, 2nd edition. West Sussex,
Enland: Wiley & Sons.

Wald, A. (1939). Contributions to the theory of statistical estimation and testing hypotheses.
Annals of Mathematical Statistics, 10 (4), 299-326.

Waid, W. M., Orne, E. C. & Orne, M.T. (1981). Selective memory for social information, alertness,
and physiological arousal in the detection of deception. Journal of Applied Psychology, 66,
224-232.

Wells, G. L. & & Olson, E.A. (2002). Eyewitness identification: Information gain from incriminating
and exonerating behaviors. Journal of Experimental Psychology: Applied, 8, 155–167.

Wickens, T. D. (1991). Maximum-likelihood estimation of a multivariate Gaussian rating model


with excluded data. Journal of Mathematical Psychology, 36, 213-234.

Wickens, T. D. (2002). Elementary signal detection theory. New York: Oxford.

Widup, R. & Barland, G.H. (1994). Effect of the location of the numbers test on examiner decision
rates in criminal psychophysiological detection of deception tests. Department of Defense
Polygraph Institute.

Wilson, J. & Jungner, G. (1968). Principles and practice of screening for disease. WHO Chronicle
Geneva: World Health Organization, 22 (11):473.

Polygraph, 2015, 44(1) 61


Attachment C: Steyn - polygraph report April 2019
Lafayette Instrument Company
Empirical Scoring System - Multinomial
(Nelson, 2017)

Examinee 19N0301Steyn
Result: NO SIGNIFICANT REACTIONS INDICATIVE OF DECEPTION
Posterior odds 3.9 to 1 odds of truth-telling (0.79)
Lower limit 2.5 (0.72, one-tailed alpha = 0.05)
Prior odds (deception) 1 (0.5)
Prior odds (truth-telling) 1 (0.5)
Bayes factor 3.9

Technique Event Specific (single issue)


Decision Rule Two-stage rules

Questions
R5 In your book Lost Boys of Bird Island did you fabricate any of your reported information sources? (No)
R6 Did you falsify any of the allegation you wrote about those persons in your book Lost Boys of Bird Island? (No)
R8 Regarding your book Lost Boys of Bird Island did you falsify any of the reported allegations about those persons? (No)
R9 In your book Lost Boys of Bird Island did you include any of those allegations without an actual human source? (No)
Test Details Analysis Parameters Questions
PF Name Prior probability .5 Cutscores Question Lower limit Result:
Exam date 4/1/2019 Cut ratio 1 Total NSR/NDI 3 R5 NDI/NSR
Examiner Raymond Nelson CI alpha SR/DI (1-tailed) .05 Total SR/DI -3 R6 NDI/NSR
Report Date: 4/6/2019 CI alpha NSR/NDI (1-tail .05 R8 NDI/NSR
Pairwise Alpha (1-talied .01 Subtotal SR/DI -9 R9 NDI/NSR
Alpha - Test of Proporti .05
Summary of Analysis ESS Scores
Recorded physiological data were evaluated with the Empirical Scoring System (ESS). The ESS is Chart 1
an evidence-based, standardized protocol for polygraph test data analysis using a Bayesian R5 R6 R8 R9
classifier with a multinomial reference distribution. Bayesian analysis treats the parameter of T 0 0 1 0
interest (i.e., deception or truth-telling) as a probability value for which the test/experimental data, A 0 0 1 0
together with the prior probability, are a basis of information to calculate a posterior probability. E -2 -2 -2 -2
The multinomial reference distribution is calculated from the analytic theory of the polygraph test - C -1 1 1 0
that greater changes in physiological activity are loaded at different types of test stimuli as a V 1 1 0 0
function of deception or truth-telling in response to relevant target stimuli. The reference Chart 2
distribution for this exam describes the probabilities associated with the numerical scores for all R5 R6 R8 R9
possible combinations of all possible test scores for 3 to 5 presentations of 4 relevant questions T NA NA 0 -1
using an array of 4 recording sensors: respiration, electrodermal, cardiovascular, and vasomotor. A NA NA 0 -1
E -2 2 -2 -2
These results were calculated using a prior probability of 0.5 for which the prior odds of truth- C 1 1 -1 1
telling were 1 to 1. A credible-interval (Bayesian confidence interval) was also calculated for the V 0 -1 0 1
posterior odds of truth-telling using the Clopper-Pearson method and a one-tailed alpha = 0.05.
Chart 3
The credible-interval describes the variability of the analytic result by treating the test statistic
R5 R6 R8 R9
(posterior odds) as a random variable for which the limits of the credible interval can be inferred
T -1 -1 NA NA
statistically from the test data. A test result is statistically significant when the lower limit of the
A -1 -1 NA NA
credible interval for the posterior odds has exceeded the greater value of the prior odds or the
E 2 0 -2 -2
required minimum cut-ratio.
C 1 1 -1 -1

The categorical test result was parsed from the probabilistic result using two-stage decision rules. V 1 1 NA NA
Two-stage rules are based on an assumption that the criterion variance of the test questions is Chart 4
non-independent, and make use of both the grand total and subtotal scores to achieve a R5 R6 R8 R9
categorical classification of the probabilistic test result. The grand total score of 8 equaled or T 0 0 0 1
exceeded the required numerical cutscore (3). These data produced a Bayes factor of 3.9. The A 0 1 1 0
posterior odds of truth-telling was 3.9 to 1, for which the posterior probability was 0.79. The lower E 2 2 -2 0
limit of the 1-alpha Bayesian credible interval was 2.5 to 1, which exceeded the prior odds (1 to 1). C 1 1 1 1
This indicates a 95% likelihood that the posterior odds of truth-telling exceed the prior odds. These V 0 -1 0 1
analytic results support the conclusion that there were NO SIGNIFICANT REACTIONS Chart 5
INDICATIVE OF DECEPTION in the loading of recorded changes in physiological activity in R5 R6 R8 R9
response to the relevant test stimuli during this examination. T 0 0 0 0
A 0 0 0 0
E 2 -2 2 -2
C 1 1 1 1
V 1 1 1 1

R5 R6 R8 R9
Subtotals 7 6 -2 -3
Grand tota 8

References
American Polygraph Association (2011). Meta-analytic survey of criterion accuracy of validated polygraph techniques. Polygraph, 40(4), 196-305. [Electronic version] Retrieved August 20, 2012, from
http://www.polygraph.org/section/research-standards-apa-publications.

Bell, B. G., Raskin, D. C., Honts, C. R. & Kircher, J.C. (1999). The Utah numerical scoring system. Polygraph, 28(1), 1-9.

Department of Defense (2006). Federal Psychophysiological Detection of Deception Examiner Handbook. Reprinted in Polygraph, 40(1), 2-66.

Nelson, R. (2017). Multinomial reference distributions for the Empirical Scoring System. Polygraph and Forensic Credibility Assessment. 46(2), 81-115.

Nelson, R. (2017). Updated numerical distributions for the Empirical Scoring System: an accuracy demonstration with archival datasets with and without the vasomotor sensor. Polygraph and
Forensic Credibility Assessment. 46(2), 116-131.

Nelson, R. & Handler, M. (2010). Empirical Scoring System. Lafayette Instrument Company.

Nelson, R. & Handler, M. (2012). Using Normative Reference Data with Diagnostic Exams and the Empirical Scoring System. APA Magazine, 45(3), 61-69.

Nelson, R., Handler, M., Shaw, P., Gougler, M., Blalock, B., Russell, C., Cushman, B. & Oelrich, M. (2011). Using the Empirical Scoring System. Polygraph, 40, 67-78.

(11/07/2017 rn)
4/6/2019 19N0401Steyn
Attachment D: Steyn - polygraph report April 2019
Lafayette Instrument Company
Objective Scoring System ­ Version 3
By Raymond Nelson, Mark Handler and Donald Krapohl (2007)

Result No Significant Reactions
Description p­value: 0.033 ­ Probability this result was produced by a deceptive person
Exam Type Event Specific/Single Issue (Zone)
Scoring Method OSS­3 Two­stage (Senter 2003)
Test of Proportions 0.091 ­ No significant differences in artifact distribution
PF Name 19N0401Steyn
Report Date Saturday, April 06, 2019
Subject Chris Steyn
Examiner System Administrator
   
Spot Scores Decision Alpha (1 tailed) Components
 
Cumulative normal distribution (Barland 1985)
ID p­value Result Setting Value Component Weight
R9 0.275   NSR 0.050 Pneumo 0.19
R5 0.010   SR 0.050 EDA 0.53
R6 0.003   Bonferroni corrected alpha 0.013 Cardio 0.28
R8 0.049   Test of Proportions (1 tailed) 0.050    
 
Relevant Questions
ID Question Text Answer
R9 In your book Lost Boys of Bird Island did you fabricate any of your reported information sources? No
R5 Did you falsify any of the allegation you wrote about those persons in your book Lost Boys of Bird Island? No
R6 Regarding your book Lost Boys of Bird Island did you falsify any of the reported allegations about those persons? No
R8 In your book Lost Boys of Bird Island did you include any of those allegations without an actual human source? No
 
Charts Scored
Exam Chart Date Time
2 2 4/1/2019 1:23 PM
2 3 4/1/2019 1:32 PM
2 4 4/1/2019 1:40 PM
2 5 4/1/2019 1:49 PM
2 6 4/1/2019 1:57 PM
 
Remarks

 
Measurements Standardized Lognormal Ratios
(Kircher and Raskin 1988)

Exam 2 Chart 2 Exam 2 Chart 2
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 A443 393 382 432 337 399 491 P 0.00 2.08 1.33 3.00
P2 A274 241 274 293 313 295 336 EDA ­2.01 ­1.15 ­2.07 ­0.54
EDA A201 243 65 158 110 251 117 Cardio 1.28 0.47 2.15  
Cardio A161 79 88 92 117 67 A41 WMean ­0.70 ­0.07 ­0.24 0.40
SE A385 163 173 161 153 143 154 Mean ­0.15
Exam 2 Chart 3 Exam 2 Chart 3
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 A281 227 377 424 A366 366 A649 P ­3.00 1.46 0.00  
file:///Users/raymondnelson/Dropbox/R/NCCA_ASCII_Parse/data/19N0401Steyn/Documents/Oss3/19N0401Steyn_.html 1/2
4/6/2019 19N0401Steyn
P2 A213 123 276 292 289 233 A392 EDA ­1.38 ­1.43 0.80  
EDA 116 222 103 228 A153 75 A248 Cardio        
Cardio A110 39 A103 97 A96 51 A140 WMean ­1.81 ­0.66 0.58  
SE 158 122 179 127 140 144 A157 Mean ­0.63
Exam 2 Chart 4 Exam 2 Chart 4
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 327 292 523 341 A332 354 A478 P ­3.00 ­2.22 0.00  
P2 239 267 362 228 A327 296 A379 EDA ­0.79 0.84 ­0.06 ­1.60
EDA 89 151 111 67 A290 105 226 Cardio 1.64 3.00 3.00 1.40
Cardio A64 64 89 29 A202 47 67 WMean ­0.53 0.86 0.81 ­0.56
SE 122 154 143 145 A414 123 133 Mean 0.14
Exam 2 Chart 5 Exam 2 Chart 5
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 252 334 288 345 353 356 366 P 2.70 1.74 2.28 2.29
P2 159 229 181 204 195 219 217 EDA ­0.05 ­0.39 0.57 ­0.46
EDA 96 97 41 115 141 71 119 Cardio 3.00 ­0.62 2.85 3.00
Cardio 47 27 A118 58 58 30 14 WMean 1.33 ­0.05 1.54 1.04
SE 126 124 130 124 109 122 132 Mean 0.96
Exam 2 Chart 6 Exam 2 Chart 6
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 372 400 355 336 355 A604 335 P 1.35 0.00   ­1.20
P2 190 209 198 189 176 A411 159 EDA ­1.78 2.81   0.16
EDA 66 237 101 24 120 A189 90 Cardio 3.00 3.00    
Cardio 85 25 98 14 100 A27 A65 WMean 0.16 2.32   ­0.20
SE 110 117 118 91 104 A169 129 Mean 0.76
 
Channel Contributions Results
Component Proportion Area Chart Proportion ID Proportion Weighted Mean
Pneumo 0.250 0.832 Exam 2 Chart 2 0.162 R9 0.159   ­0.31 0.48 0.67 0.17
EDA 0.397 0.226 Exam 2 Chart 3 0.212 R5 0.288   Grand Total Mean
Cardio 0.353 0.565 Exam 2 Chart 4 0.161 R6 0.315 0.25
      Exam 2 Chart 5 0.240 R8 0.238
      Exam 2 Chart 6 0.224    
 

Advanced Options ­ OSS­3 v1.9
General Scoring Settings
Delete all zero measurements Yes
Zero Threshold value 1
Allow a single CQ to score result (not for DLST) Yes
Replace missing values with mean values No
Check Extreme Contributions No
Allow SR Result when extreme contributions Yes
Alpha Values (one­tailed)
Kruskal­Wallis 0.1
Non­Significant Response (NSR) 0.05
Significant Response (SR) 0.05
Test of Proportions
Test of Proportions alpha value (two­tailed) 0.1
Use Test of Proportions Yes
Allow significant reaction result Yes
Use all questions No
Score neutral questions as control No
Event Specific/Single Issue (Zone)
Use Bonferroni Yes
Use Kruskal­Wallis No
Minimum number of useable presentations for RQs 2
Measurement Periods
P1 15
P2 15
EDA 15
Cardio 15
All other 15

file:///Users/raymondnelson/Dropbox/R/NCCA_ASCII_Parse/data/19N0401Steyn/Documents/Oss3/19N0401Steyn_.html 2/2
4/6/2019 19N0401Steyn
Attachment E: Steyn - polygraph report April 2019
Lafayette Instrument Company
Objective Scoring System ­ Version 3
By Raymond Nelson, Mark Handler and Donald Krapohl (2007)

Result No Significant Reactions
Description p­value: 0.002 ­ Probability this result was produced by a deceptive person
Exam Type Event Specific/Single Issue (Zone)
Scoring Method OSS­3 Two­stage (Senter 2003)
Test of Proportions None ­ No significant differences in artifact distribution
PF Name 19N0401Steyn
Report Date Saturday, April 06, 2019
Subject Chris Steyn
Examiner System Administrator
   
Spot Scores Decision Alpha (1 tailed) Components
 
Cumulative normal distribution (Barland 1985)
ID p­value Result Setting Value Component Weight
R9 0.024   NSR 0.050 Pneumo 0.19
R5 < 0.001   SR 0.050 EDA 0.53
R6 < 0.001   Bonferroni corrected alpha 0.013 Cardio 0.28
R8 0.004   Test of Proportions (1 tailed) 0.050    
 
Relevant Questions
ID Question Text Answer
R9 In your book Lost Boys of Bird Island did you fabricate any of your reported information sources? No
R5 Did you falsify any of the allegation you wrote about those persons in your book Lost Boys of Bird Island? No
R6 Regarding your book Lost Boys of Bird Island did you falsify any of the reported implications about those persons? No
R8 In your book Lost Boys of Bird Island did you include any of those allegations without an actual human source? No
 
Charts Scored
Exam Chart Date Time
2 2 4/1/2019 1:23 PM
2 3 4/1/2019 1:32 PM
2 4 4/1/2019 1:40 PM
2 5 4/1/2019 1:49 PM
2 6 4/1/2019 1:57 PM
 
Remarks

 
Measurements Standardized Lognormal Ratios
(Kircher and Raskin 1988)

Exam 2 Chart 2 Exam 2 Chart 2
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 443 393 382 432 337 399 491 P 0.00 1.38 0.64 2.57
P2 274 241 274 293 313 295 336 EDA ­1.29 ­0.43 ­1.35 0.18
EDA 201 243 65 158 110 251 117 Cardio 2.19 1.39 3.00 3.00
Cardio 161 79 88 92 117 67 41 WMean ­0.07 0.43 0.25 1.43
SE 385 163 173 161 153 143 154 Mean 0.51
Exam 2 Chart 3 Exam 2 Chart 3
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 281 227 377 424 366 366 649 P ­3.00 2.39 0.00 3.00
file:///Users/raymondnelson/Dropbox/R/NCCA_ASCII_Parse/data/19N0401Steyn/Documents/Oss3/19N0401Steyn_noArtifacts.html 1/2
4/6/2019 19N0401Steyn
P2 213 123 276 292 289 233 392 EDA ­1.13 ­1.18 1.05 ­1.35
EDA 116 222 103 228 153 75 248 Cardio 3.00 0.22 3.00 ­1.71
Cardio 110 39 103 97 96 51 140 WMean ­0.33 ­0.10 1.39 ­0.62
SE 158 122 179 127 140 144 157 Mean 0.08
Exam 2 Chart 4 Exam 2 Chart 4
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 327 292 523 341 332 354 478 P ­2.44 ­2.49 ­0.64 2.26
P2 239 267 362 228 327 296 379 EDA 0.20 1.83 0.92 ­0.61
EDA 89 151 111 67 290 105 226 Cardio 3.00 3.00 3.00 2.90
Cardio 64 64 89 29 202 47 67 WMean 0.48 1.33 1.21 0.92
SE 122 154 143 145 414 123 133 Mean 0.98
Exam 2 Chart 5 Exam 2 Chart 5
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 252 334 288 345 353 356 366 P 2.70 1.74 2.28 2.29
P2 159 229 181 204 195 219 217 EDA ­0.05 ­0.39 0.57 ­0.46
EDA 96 97 41 115 141 71 119 Cardio 3.00 1.21 3.00 3.00
Cardio 47 27 118 58 58 30 14 WMean 1.33 0.47 1.58 1.04
SE 126 124 130 124 109 122 132 Mean 1.10
Exam 2 Chart 6 Exam 2 Chart 6
  C7 R9 C10 R5 C4 R6 R8   R9 R5 R6 R8
P1 372 400 355 336 355 604 335 P 1.35 0.00 3.00 ­1.20
P2 190 209 198 189 176 411 159 EDA ­1.78 2.81 ­1.33 0.16
EDA 66 237 101 24 120 189 90 Cardio 3.00 3.00 3.00 1.87
Cardio 85 25 98 14 100 27 65 WMean 0.16 2.32 0.72 0.38
SE 110 117 118 91 104 169 129 Mean 0.89
 
Channel Contributions Results
Component Proportion Area Chart Proportion ID Proportion Weighted Mean
Pneumo 0.237 0.788 Exam 2 Chart 2 0.186 R9 0.206   0.31 0.89 1.03 0.63
EDA 0.326 0.097 Exam 2 Chart 3 0.143 R5 0.269   Grand Total Mean
Cardio 0.436 0.848 Exam 2 Chart 4 0.224 R6 0.281 0.71
      Exam 2 Chart 5 0.231 R8 0.244 (Without Visual Inspection)
      Exam 2 Chart 6 0.217    
 

Advanced Options ­ OSS­3 v1.9
General Scoring Settings
Delete all zero measurements Yes
Zero Threshold value 1
Allow a single CQ to score result (not for DLST) Yes
Replace missing values with mean values No
Check Extreme Contributions No
Allow SR Result when extreme contributions Yes
Alpha Values (one­tailed)
Kruskal­Wallis 0.1
Non­Significant Response (NSR) 0.05
Significant Response (SR) 0.05
Test of Proportions
Test of Proportions alpha value (two­tailed) 0.1
Use Test of Proportions Yes
Allow significant reaction result Yes
Use all questions No
Score neutral questions as control No
Event Specific/Single Issue (Zone)
Use Bonferroni Yes
Use Kruskal­Wallis No
Minimum number of useable presentations for RQs 2
Measurement Periods
P1 15
P2 15
EDA 15
Cardio 15
All other 15

file:///Users/raymondnelson/Dropbox/R/NCCA_ASCII_Parse/data/19N0401Steyn/Documents/Oss3/19N0401Steyn_noArtifacts.html 2/2

Anda mungkin juga menyukai