Assessing Fairness

Detecting Gender Bias in Assessments

Differential Test Functioning Plot, comparing Gender and Overall Foresight Performance


Executive Summary

  • Assessments – psychometrics, interviews, algorithms, etc. – need to be checked for fairness.
  • Fairness can be assessed using various techniques for both legally protected and unprotected groups.
  • Both Habitus and Foresight show evidence of being fair with respect to Gender. 






  • In the age of the algorithm, there is ample evidence of people not keeping an eye on fairness. A more recent example being a recruitment approach by Amazon that showed bias against women
  • However the psychology space has long had a focus on fairness on areas beyond gender; such as age, ethnicity or English as a second language.
  • How fairness / bias is measured is important, with many approaches being a little too simple and not getting to the heart of the matter.
  • This article will look at various projects that interact with fairness, to illustrate the pros and cons around measuring fairness in assessments. 


Approach 1 – Differential Test Functioning (DTF)

  • DTF is a strong method for detecting bias in an assessment.
  • It’s key advantage is that it controls for a person’s standing on the latent trait and then looks for variance that can be attributed to group membership.
  • Using less geek speak, if we had a group of really smart / crisp apples and another group of very dull / squishy bananas and measured their smarts – surely we would expect the apples to outperform the bananas. So what would we conclude? That our assessment is biased against bananas?? That may be true but the bigger issue is that we didn’t control for level of smarts before looking for any difference between our apples and bananas.  
  • The plot below illustrates DTF, in this case it plots Foresight performance on the y or vertical axis against an estimate of a candidates general mental ability on the x or horizontal axis. There are two lines, one for candidates who identified as male and the other for female*.
  • The plot shows the lines are very close together, which is ideal, as there is very little difference or said differently, minimal bias.
  • These plots can be very useful, particularly when setting cut-scores or pass-marks for an assessment, as there may be bias in an assessment but NOT where we are setting our pass-mark.

*Some candidates identified as Other, however there were too few to include in specific analyses. Generally we look for about 200 people in a ‘group’ before feeling confident in these types of analyses.



DTF and Habitus

  • DTF can also be applied to a section within an assessment, for example a specific scale within a personality questionnaire.
  • DTF produces two key outputs:
    1. An estimate of the intercept bias – this is where the line for each group would cross the y or vertical axis.
    2. An estimate of the slope bias – this is how quickly or sharply the line travels from the bottom-left to the top-right of the graph.
  • Results for Habitus are shown below, which are consistently very low.
  • The ‘worst’ performing scale with the largest bias is the Intensity scale, where people identifying with Female gender scored slightly higher than those identifying with Male (shown in the graph below).
  • The Intensity scale looks at preferences around having a stronger emotional response to events, being more passionate and expressive with respect to emotions at work.
  • Putting aside the previous research that has found similar results, let’s put this result in context. It means that when we control for Intensity, those people identifying as Female will score around 4.5% higher than Males. Each Habitus scale is out of 20, so rounding our difference up to 5%, would make this 1. Meaning that if we take a person who identifies as Female and one who identifies as Male, who really share the same level of Intensity (let’s say 10 out of 20), the Female will likely have a score of 11 on this scale (5% more than the person identifying as Male). Neither a large difference in statistical nor practical terms. 
DTF plot of the Intensity Scale within Habitus and Gender

The Bigger Picture

  • However a scale in a questionnaire or stage in an assessment process needs to be evaluated as part of a wider context.
  • In relation to Habitus, this is often combined with Foresight in the Job Fit Assessment. This allows the scales within both assessments to be mapped to tens-of-thousands of roles from the O*net library.
  • A person’ fit to a role(s) can then be assessed ‘in-the-round’ and explored at a more specific level as desired.
  • Some example images from feedback reports are shown to the right. The roles were created from cluster analysis of the original O*net library to supported a client’s graduate recruitment campaign.
  • The ‘Overall Percentile Fit’ is based on how well a person matches the role requirements, compared to the Habitus Professional comparison group.
  • This allows recruiters and hiring managers to look at fit to the role in the round and then pay closer attention to areas indicating potentially ‘too much’ or ‘too little’ of a certain work style, for a specific role(s).

Approach 2 – Mean Differences and the 4/5ths Rule

  • The swiftest method for exploring group differences on an assessment is to look at the mean differences. So we calculate the mean average performance for each group and see if these meet one or more rules-of-thumb, for example:
    • Statistical significance – depending on the number of groups involved people often use t-tests or ANOVAs. This approach clearly fails to take into account where the various group members where on the area being measured, plus the more people assessed the greater chance of finding a result that is statistically but NOT practically significant.
    • Effect Size – This is a step forward from significance testing as it looks at the size of the differences between groups. A researcher called Cohen came up with three effect sizes or levels, which can be useful for assessing just how large differences between groups really are.
    • 4/5ths rule – This one carries more weight as there is legal precedent for using it. Also the rule can result in there appearing to be bias, when none exists. The basic premise is explained below.

Effect Size

  • This example plot shows a few interesting things:
    • The size of the difference by graduate stream and gender.
    • Here all the differences are considered ‘small’ by Cohen’s standards, plus there is a ‘balancing out’ across the various streams, as some favour males (those bars with positive values) and other favouring females (those with negative values).
  • Exploring group differences like this can help understand if a particular area (or grad stream in this example) might have larger differences and hence warrant further exploration to understand the potential causes of these differences.

4/5ths Rule

  • So in the image we can see that 50 Apples (our majority group) made it through this stage in the process, meaning at least 40 Figs also need to pass to avoid there being bias based on the 4/5ths rule.
  • Being mindful of this rule often results in considering multiple ‘pass marks’. An example of this links to the role fit graphs shown earlier. In this instance a participants ‘Overall Percentile Fit’ was broken down in 10 groups, so participants with a overall fit equal to 90 or more, were all put in group 10. Participants were then separated based on gender to evaluate the 4/5ths rule at multiple potential ‘pass marks’, as illustrated below.
  • Overall Job Fit divided into 10 groups, then split by Gender.
  • A variety of ‘pass marks’ were evaluated to ensure compliance with the 4/5ths rule and hence avoid Gender bias.


Adding an Extra Degree of Difficulty

  • Whilst the above post has focused on Gender, it is common that we need to be mindful of groups in addition to Gender.
  • The super cool plot below is a client example where endorsing having a disability was an additional area that needed to be taken into account.
  • The top plots look at selection ratios for Gender and Disability, the red lines on both graphs show that a lower percentage (or ratio) of females or candidates having a disability would be selected at all pass marks. If there was no difference in the selection ratios the two lines would be on top of one another.
  • The bottom plot then applies these selection ratios with the 4/5ths rule in mind (described earlier). In this instance the client was advised to pick a pass mark (also called a ‘cut score’) that was above the evil looking red line drawn through the middle of the graph. This way there would be no violation of the 4/5ths rule. 


That’s all for now Folks!