Stats Under the Stars 8

Termina il 16.06.2025

The challenge is centered around money laundering, which is the llegal process of obtaining money from criminal activities and making it appear to be earned legally.

This is typically achieved by passing the illicit funds through legitimate financial institutions and channels to disguise their true origin.
The AML (Anti-Money Laundering) is a central issue in modern banking and finance, as institutions must comply with strict regulations and develop
systems to detect and prevent these illicit activities.

Participants will work on a realistic dataset simulating transactional banking data and will be expected to apply machine learning and analytical techniques to identify suspicious patterns.

Participants are expected to develop a solution that addresses the following problem:

Problem: A dataset containing transaction information from a bank is provided. Among thousands of legitimate transactions, a small proportion is suspected to be related to money laundering activities.

Goal:Implement a binary classification model capable of distinguishing between regular and laundering transactions

Bibliography

The following references provide foundational works and key studies
related to AML, financial fraud detection, and machine learning techniques
applied to transaction analysis.

• N. E., Ahmad. Anti-money Laundering
using Graph Techniques. Doctoral Thesis, University of Porto, 2024.

• B. Dumitrescu, A. Băltoiu, S. Budulan. Anomaly
Detection in Graphs of Bank Transactions for Anti Money Laundering
Applications. IEEE Access, 2021.

• R. Karim, F. Hermsen, Felix, S.A. Chala, P. De Perthuis,
A. Mandal.Scalable Semi-Supervised Graph Learning Techniques for Anti Money
Laundering. Proceedings of the International Conference on Machine Learning (ICML),
2022.

34 partecipanti, 772 sottomissioni

The challenge is centered around money laundering, which is the llegal process of obtaining money from criminal activities and making it appear to be earned legally.

Participants will work on a realistic dataset simulating transactional banking data and will be expected to apply machine learning and analytical techniques to identify suspicious patterns.

Participants are expected to develop a solution that addresses the following problem:

Goal:Implement a binary classification model capable of distinguishing between regular and laundering transactions

Bibliography

The following references provide foundational works and key studies
related to AML, financial fraud detection, and machine learning techniques
applied to transaction analysis.

• N. E., Ahmad. Anti-money Laundering
using Graph Techniques. Doctoral Thesis, University of Porto, 2024.

• B. Dumitrescu, A. Băltoiu, S. Budulan. Anomaly
Detection in Graphs of Bank Transactions for Anti Money Laundering
Applications. IEEE Access, 2021.

Given the low percentage of laundering transactions, a special evaluation metric is required. It combines three components:

AUC (Area Under the Curve) measures the model’s ability to distinguish between fraudulent and non-fraudulent transactions.

Balanced Accuracy: ensures the model performs well on both classes, especially when the dataset is imbalanced. It is defined as the average of the True Positive Rate and True Negative Rate:

where TP are frauds correctly classified, FN are frauds missed, TN are legitimate transactions correctly classified FP are legitimate transactions misclassified as fraud.

Fraud Capture Rate (Top N Predictions): calculates the proportion of actual fraudulent transactions found in the top N = 485 predictions with the highest fraud probability. It emphasizes prioritizing the most likely fraudulent transactions for manual review. The formulation is

where N=485 is the number of transactions selected for
review (i.e., top- 485 highest predicted probabilities), T₄₈₅ is the set of indices corresponding to the
top-485 predictions; is the true label of transaction i, where indicates a fraud; is the total number of fraudulent transactions
in the entire test set (size T).

These metrics ensure that the model not only detects fraud effectively but also focuses on the transactions that are most critical for investigation, given real-world constraints in AML systems.

The final score will be computed as the arithmetic mean of these three metrics

• There is no limit to the number of submissions that each participant can make.

• At the end of the competition, only the last submission will be considered for the calculation of the final score (so submit as last the one you consider best!). REMARK: don’t forget to upload the report in the last submission!

• The final score (based on 2/3 of the data) is calculated using only predictions that were NOT used in the calculation of the partial score (based on 1/3 of the data).

• In the event of a tie in the final score, the user who first made the last submission prevails (note that the results displayed by the platform are rounded up to the third decimal place, we will certify any score differences not displayed by the system)

The submission file is a .txt which contains two columns (without header):

• prob(Fraud) – The predicted probability
that the transaction is fraudulent (a float between 0 and 1).

• prediction – The final classification
as either 0 or 1.

Important: order matters!! The predictions must be in the same order as
the corresponding samples in the provided test set.

Per accedere ai dati occorre effettuare il login oppure registrarsi alla piattaforma www.datachallenge.it e quindi registrarsi e sottoscrivere i termini del regolamento della competizione.