Methods Corner: Let’s talk about reliability and validity

Adar Ben-Eliyahu, Ph.D - The Chronicles of Evidence-Based Mentorshipby Adar Ben-Eliyahu, Ph.D.

In evaluations of mentoring programs, we select questionnaires that we think measure various outcomes. So, for example, if we think that mentoring will improve self-esteem, we might give a questionnaire that asks mentees how good they feel about themselves.

In contrast to something like height or speed, self-esteem can’t be seen. Instead we hope that our questionnaire “gets” at the construct. So, we can’t help but ask two important questions:

  • Are we really measuring what we intend to measure?
  • How consistent is my measurement?

The first question – are we measuring what we intend to measure – pertains to the validity of the concept we are measuring. Is the measure actually measuring self-esteem.

The second question relates to the consistency, or the reliability of the concept we are measuring.  If the test is designed to measure self-esteem, then each time the test is used to measure it, it should yield the roughly the same result.

So, validity is a measure of the accuracy of the test (is this an accurate test of self-esteem) while reliability is a measure of the precision of the test (is it a consistent measure of self-esteem).

Validity: Is this what we are measuring?

There are many different ways to evaluate validity or accuracy in measuring our intended concept.

One group of validity measures investigates the extent to which our measure aligns with our theory (i.e., maps onto the variable of interest). In this approach, we are interested in:

  1. Construct validity or the extent to which the measurement maps onto the variable (e.g., self-esteem).
  2. Content validity refers to the breadth of the measurement – is it measuring the different aspects of the theorized content or is the measure too narrow?
  3. Face validity refers to whether the measure appears to be measuring the desired content. Sometimes certain survey items may look like they measure what we would like, but they don’t, whereas other items may look like they are irrelevant when in fact they do tap into an aspect of our theory.

A great example of a measure that has all three of these validities is in the recently summarized article about natural mentors for homeless youth in which social support was measured using instrumental, informational, and emotional support. By measuring these three forms of support, the measurement maps onto theoretical conceptualization of support (Construct Validity). The breadth of the measure is evidence for Content Validity. And Face Validity is supported by looking at the particular question from the survey.

Another form of validity inspects the extent to which our construct is related to or different from other constructs.

  1. Criterion validity is established when the desired construct is correlated with another one known to indicate it.
  2. Concurrent validity refers to the degree to which the measurement of our construct correlates with other measures of the same construct that are measured at the same time.
  3. Predictive validity refers to the degree to which our construct can predict (or correlate with) other measures of the same construct that are measured at some time in the future.

In the article about self-regulated learning and mentoring, these three forms of validity can be established by inspecting the correlations between the sub-components of self-regulated learning. For example, we would expect both strategic planning and organizational strategies to be associated, suggesting concurrent and criterion validity. Predictive validity can be established by looking at these strategies over time.

Two forms of validity are related to the research design:

  1. External validity is the extent to which the research findings apply to other groups of people or situations not studied within the particular study. For example, is mentoring supporting adolescents with diabetes similar to mentoring supporting adults with diabetes? Or mentoring supporting adolescents with another type of diagnosis such as depression?
  2. Internal validity considers any alternative reasons that explain the observed effects. For example, with studies that examine effects over time (longitudinal studies), regular development may also be able to explain the results – therefore we prefer to have a control group in addition to the experimental group.

Reliability: How consistent is my measure?

There are four types of reliabilities, often denoted as alpha .

  1. Internal consistency is probably the most common type of reliability reported on in social science research. This is often referred to as Cronbach’s alpha. The type of reliability evaluated the extent to which the questions in the questionnaire assess the intended construct. This is a sort of correlation between questionnaire items. However, the calculation takes into account the number of questions or items used such that there are benefits (higher alphas) for including more questions. Cronbach’s alpha is standardized with .6-.8 moderate and above .8 considered to be high reliability.
  2. Test-retest reliability refers to the consistency when measuring, then re-measuring the same construct with the same exact questionnaire. This type of reliability is exposed to many measurement issues, including the respondent’s familiarity with the questions and remembering how they had answered it the first time. However, we would expect a positive association with the different times of administration.
  3. Inter-rater reliability assesses the extent to which two individual raters or researchers agree on statements or observations. This is a good indicator for qualitative research that uses observational methods, interviews, open-responses, and other such qualitative techniques.
  4. Inter-method reliability examines the accuracy of an evaluation when using different research methods. For example, a questionnaire about relationship quality can be used in parallel to an interview. Researchers can examine the extent to which the questionnaire aligns with the interview.

Schwartz, Rhodes, Spencer, & Grossman (2013) provide a good example of a study that used a number of methods to investigate how mentoring relationship duration is related to a variety of outcomes. They note “Qualitative data from these participants supported quantitative findings, indicating that enduring mentoring relationships could positively influence participant outcomes”(p. 11), and also suggesting high inter-method reliability.