Assessments with Only Face Validity

Photo: Eyes behind

I visit LinkedIn often and enjoy reading and participating in some of the discussions in the industrial-organizational psychology groups. The other day, a question came up regarding whether an instrument/assessment/tool with only “face validity” without other types of validity is valuable to clients in a business setting.

People often bring up face validity, but “face validity relates more to what a test appears to measure to the person being tested than to what the test actually measures” (Cohen & Swerdlik, 2009, p. 174). This distinction is very important, as you’ll see.

I am not an expert in psychometrics, but I believe there are some important things to consider (beyond the face validity or “face value”) when evaluating an assessment, whether it’s for a business, educational, or personal use. These things are: content, criterion, and construct validity; reliability; and norming.

VALIDITY – Is the test measuring what it says it measures?

Face validity relates more to what a test appears to measure to the person being tested than to what the test actually measures” (Cohen & Swerdlik, 2009, p. 174). A test that seems, on the face of it, to measure what it claims to measure, has good “face validity” in the eyes and mind of the testtaker/respondent. In other words, if I believe that a test I’m taking looks legitimate, it will give me confidence in the test and help keep me motivated as I’m taking it. In terms of selling a test, a test/assessment with good face validity helps convince potential buyers (e.g., supervisors, HR staff, executives, etc.) to “buy in.”

However, what many do not realize is that even if a test lacks face validity, it can still be relevant and useful, even if (without good face validity) it might be poorly received by testtakers. “Ultimately, face validity may be more a matter of public relations than psychometric soundness, but it seems important nonetheless” (Cohen & Swerdlik, 2009, p. 176).

Content validity means the content of the test looks like the content of the job. “For an employment test to be content-valid, its content must be a representative sample of the job-related skills required for employment” (Cohen & Swerdlik, 2009, p. 176). Content validity has to do with whether or not the test ‘covered all bases.’ For instance, a test of American history that has only questions (items) about the Civil War has inadequate content validity because the questions would not be representative of the entire subject of American history (i.e., the Civil War was a significant but small part of the entire history of the United States) (Vogt & Johnson, 2011). Another example of a test that is not content valid would be a depression inventory that only asks questions about feelings of sadness. Again, this is illustrates inadequate content validity because there are other aspects of depression that need to be considered (i.e. energy level, concentration ability, and weight gain/loss, etc.) (Barber, Korbanka, Stradleigh, & Nixon, 2003).

Criterion-related validity is the ability of a test to make accurate predictions. For example, the degree to which a student’s SAT score predicts his college grade is an indication of the SAT’s criterion-related validity (Vogt & Johnson, 2011).

Construct validity is the degree to which variables on a test accurately measure the construct. According to Cohen and Swerdlik (2009): “Construct validity is a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct. A construct is an informed, scientific idea developed or hypothesized to describe or explain behavior” (p. 193). Examples of constructs are intelligence, job satisfaction, self-esteem, anxiety, etc. “The researcher investigating a test’s construct validity must formulate hypotheses about the expected behavior of high scorers and low scorers on the test. These hypotheses give rise to a tentative theory about the nature of the construct the test was designed to measure. If the test is a valid measure of the construct, then high scorers and low scorers will behave as predicted by the theory” (Cohen & Swerdlik, 2009, p. 193).

Dr. Wendell Williams, in his book called “Superselection: The art and the science of employee selection and placement” (2003), offered a nice breakdown of the different types of validity as it relates to determining whether test scores are related to job performance (p. 4):

  • Content validation: Does the content of the test resemble the content of the job?
    • Example: A test requiring an applicant to type a letter is considered content valid because passing the test demonstrates an applicant’s typing ability.
  • Criterion validation: Do higher test scores predict higher job performance?
    • Example: The test would satisfy the requirements for criterion-related validity if higher scores on that typing test were associated with better performance on the job.
  • Construct validation: Does the test measure deep-seated mental constructs that are associated with job performance?
    • Dr. Williams said construct validity in employee selection testing is difficult to identify or interpret and warns against relying on or using it. For instance, he offered the construct of attitude. “If you discovered that ‘attitude’ had something to do with keyboarding skills, you could give each applicant an attitude test. However, you would have the burden of proving that attitude (a construct) predicted typing ability” (Williams, 2003, p. 4).


Reliability – Is the instrument consistent, stable, repeatable? Reliability is measured in a number of different ways including test-retest reliability, split-half reliability, alpha reliability and inter-rater (coder) reliability. Reliability is the ability of a test or assessment to consistently measure the topic or construct under study at different times and across different populations (Hinton, Brownlow, McMurray, & Cozens, 2004). For example, a bathroom scale that gives you a different reading of your weight each time you step on it is not reliable.


“In a psychometric context, norms are the test performance data of a particular group of testtakers that are designed for use as a reference when evaluating or interpreting individual test scores” (Cohen & Swerdlik, 2009, p. 111).

Here’s a good example from Cohen and Swerdlik (2009) to illustrate this idea of norming and generalizability:

“For example, a test carefully normed on school-age children who reside within the Los Angeles school district may be relevant only to a lesser degree to school-age children who reside within the Dubuque, Iowa, school district. How many children in the standardization sample were English speaking? How many were of Hispanic origin? How does the elementary school curriculum in Los Angeles differ from the curriculum in Dubuque? These are the types of questions that must be raised before the Los Angeles norms are judged to be generalizable to the children of Dubuque” (p. 117).

When I worked for a school system overseas on a small Pacific Island, the school psychologists would give the students U.S.-developed educational assessments that were not “normed” with Pacific Island students. So what they did was add a small “warning” statement on their reports to let everyone know. Of course, none of the students and almost none of the parents understood what this meant. What’s even more tragic is that I never saw attempts by these psychologists to help train educators on reading and interpreting these assessments. So the majority of the teachers also didn’t understand this either.

It is very tempting to take an assessment, receive back a nice-looking report, with beautiful graphs and professional-sounding words (often computer-generated), and think the assessment is “valid” or “good.”

As Wendell Williams, MBA, Ph.D. (an industrial psychologist who develops and validates tests) warns: We need to be extra careful to ensure that a test/assessment meets professional standards, or else we’re basically giving back scores that are worthless junk to people who will completely trust the results.

The bottom line: Legally, ethically, and statistically make sure you use a test that has good psychometric properties, beyond just a pretty face (i.e. face validity). A test with weak/inadequate content, criterion, and/or construct validity; poor reliability; and inadequate norming is useless.


Barber, A., Korbanka, J., Stradleigh, N., & Nixon, J. (2003). Research and statistics for the social sciences. Boston: Pearson.

Cohen, R. J., & Swerdlik, M. E. (2009). Psychological testing and assessment: An introduction to tests and measurement (7th ed.). New York, NY: McGraw-Hill.

Hinton, P. R., Brownlow, C., McMurray, I., & Cozens, B. (2004). SPSS explained. New York: Routledge.

Vogt, W. P., & Johnson, R. B. (2011). Dictionary of statistics & methodology: A nontechnical guide for the social sciences (4th ed.). Thousand Oaks, CA: Sage.

Williams, R. W. (2002, August 29). Separating Good Sense From Nonsense: Who Can You Believe?. Retrieved from

Williams, R. W. (2003). Superselection: The art and the science of employee selection and placement. Acworth, Georgia: ScientificSelection Press.

Williams, R. W. (2005, October 6). Anatomy of a Test Vendor. Retrieved from

Williams, R. W. (2005, November 3). Is This Test Validated for Your Industry?. Retrieved from

Williams, R. W. (2007, May 18). Validating a Personality Test. Retrieved from

Williams, R. W. (2007, October 31). Good Test? Bad Test? Retrieved from

Williams, R. W. (2008, December 10). Dissecting the DISC. Retrieved from

Williams, R. W. (2010, February 10). Promises, Promises: How to Identify a Bad Hiring Test (Part I of II). Retrieved from

Williams, R. W. (2010, February 11). Promises, Promises: How to Identify a Bad Hiring Test (Part II of II). Retrieved from

Williams, R. W. (2010, June 24). Uncovering Test Secrets, Part 1. Retrieved from

Williams, R. W. (2010, June 25). Uncovering Test Secrets, Part 2. Retrieved from

Williams, R. W. (2012, February 1). Bad Tests and Fake Bird Seed. Retrieved from