Project Review: Detecting Mule-accounts with ML, in batch mode

Last updated: 08 Jul 2025

Table of Contents

Problem setting
Technical overview
Challenges

Problem setting

How to detect which accounts in a digital bank are actually mule accounts, i.e. accounts whose purpose is to facilitate crime. Every bank needs to detect and remove these accounts or they may be liable to regulatory fines and PR issues.

Technical overview

The solution is composed of two parts:

A batch model, which assigns a risk score to every customer;
A real-time decision layer, which checks the most recent risk score for a customer whenever certain events happen. If the score is larger than a threshold, flag that customer's account and send to human analysis to decide whether to close it.

The training dataset was composed of one row per customer, per day. The target was $1$ if that account has been closed as a true mule account within $X$ days from the scoring date, and $0$ otherwise.

Challenges

Human analysis training bias

The target carries an implicit bias — only accounts that were flagged by earlier detection systems could even have a chance of being analysed by the operations team.

In effect, the model is learning how the operation team works.

Early detection vs actionability:

Ideally a bank wants to detect a mule account as early as possible (the longer a mule account is active, the more damage it can do).

But if one trains a model with a target too far into the future, it's hard for human analysts to decide for sure whether the flagged account is a true or false positive. Shorter targets mean that we will have fewer positive targets to train with, but the signal may be stronger and help the operations team find enough evidence to act on it.

Explainable scores:

This use-case had a strong explainability requirement, as it would trigger a manual analysis by a human. Understanding which features contributed the most to a high risk score was very important.

Operations capacity issues:

The operations team (who needs to manually check all accounts flagged by the model) is a finite resource. One cannot flag too many accounts or the operations team will be overwhelmed.

The threshold used in the decision layer had to be manually changed from time to time to account for peaks in flagging events and other factors that affect the operations team's availability (extra headcount, team attrition, etc).

Felipe 10 Jun 2020 08 Jul 2025 machine-learning fraud project-review

Problem setting

Technical overview

Challenges

Dialogue & Discussion