Analysis of the Neighborhood Parameter on Outlier Detection Algorithms - Evaluation Tests
conducted for the paper: Impact of the Neighborhood Parameter on Outlier Detection Algorithms by F. Iglesias, C. Martínez, T. Zseby
Context and methodology
A significant number of anomaly detection algorithms base their distance and density estimates on neighborhood parameters (usually referred to as k). The experiments in this repository analyze how five different SoTA algorithms (kNN, LOF, LooP, ABOD and SDO) are affected by variations in k in combination with different alterations that the data may undergo in relation to: cardinality, dimensionality, global outlier ratio, local outlier ratio, layers of density, inliers-outliers density ratio, and zonification. Evaluations are conducted with accuracy measurements (ROC-AUC, adjusted Average Precision, and Precision at n) and runtimes.
This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.
Technical details
Experiments are in Python 3 (tested with v3.9.6). Provided scripts generate all data and results. We keep them in the repo for the sake of comparability and replicability. The file and folder structure is as follows:
results_datasets_scores.zip contains all results and plots as shown in the paper, also the generated datasets and files with anomaly
dependencies.sh for installing required Python packages in a clean environment.
generate_data.py creates experimental datasets.
outdet.py runs outlier detection with ABOD, kNN, LOF, LoOP and SDO over the collection of datasets.
indices.py contains functions implementing accuracy indices.
explore_results.py parses results obtained with outlier detection algorithms to create comparison plots and a table with optimal ks.
test_kfc.py rusn KFC tests for finding the optimal k in a collection of datasets. It requires kfc.py, which is not included in this repo and must be downloaded from https://github.com/TimeIsAFriend/KFC. kfc.py implements the KFCS and KFCR methods for finding the optimal k as presented in: [1]
explore_kfc.py parses results obtained with KFCS and KFCR methods to create latex tables.
README.md provides explanations and step by step instructions for replication.
References
[1] Jiawei Yang, Xu Tan, Sylwan Rahardja, Outlier detection: How to Select k for k-nearest-neighbors-based outlier detectors, Pattern Recognition Letters, Volume 174, 2023, Pages 112-117, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2023.08.020.
License
The CC-BY license applies to all data generated with the "generate_data.py" script. All distributed code is under the GNU GPL license.