Analysis of the Neighborhood Parameter on Outlier Detection Algorithms - Evaluation Tests - Dataset

Dataset

Analysis of the Neighborhood Parameter on Outlier Detection Algorithms - Evaluation Tests

DOI

Analysis of the Neighborhood Parameter on Outlier Detection Algorithms - Evaluation Tests

conducted for the paper: Impact of the Neighborhood Parameter on Outlier Detection Algorithms by F. Iglesias, C. Martínez, T. Zseby

Context and methodology

A significant number of anomaly detection algorithms base their distance and density estimates on neighborhood parameters (usually referred to as k). The experiments in this repository analyze how five different SoTA algorithms (kNN, LOF, LooP, ABOD and SDO) are affected by variations in k in combination with different alterations that the data may undergo in relation to: cardinality, dimensionality, global outlier ratio, local outlier ratio, layers of density, inliers-outliers density ratio, and zonification. Evaluations are conducted with accuracy measurements (ROC-AUC, adjusted Average Precision, and Precision at n) and runtimes.

This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

Technical details

Experiments are in Python 3 (tested with v3.9.6). Provided scripts generate all data and results. We keep them in the repo for the sake of comparability and replicability. The file and folder structure is as follows:

results_datasets_scores.zip contains all results and plots as shown in the paper, also the generated datasets and files with anomaly

dependencies.sh for installing required Python packages in a clean environment.

generate_data.py creates experimental datasets.

outdet.py runs outlier detection with ABOD, kNN, LOF, LoOP and SDO over the collection of datasets.

indices.py contains functions implementing accuracy indices.

explore_results.py parses results obtained with outlier detection algorithms to create comparison plots and a table with optimal ks.

test_kfc.py rusn KFC tests for finding the optimal k in a collection of datasets. It requires kfc.py, which is not included in this repo and must be downloaded from https://github.com/TimeIsAFriend/KFC. kfc.py implements the KFCS and KFCR methods for finding the optimal k as presented in: [1]

explore_kfc.py parses results obtained with KFCS and KFCR methods to create latex tables.

README.md provides explanations and step by step instructions for replication.

References

[1] Jiawei Yang, Xu Tan, Sylwan Rahardja, Outlier detection: How to Select k for k-nearest-neighbors-based outlier detectors, Pattern Recognition Letters, Volume 174, 2023, Pages 112-117, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2023.08.020.

License

The CC-BY license applies to all data generated with the "generate_data.py" script. All distributed code is under the GNU GPL license.

Identifier
DOI	https://doi.org/10.48436/xvy1m-jwg83
Related Identifier	Cites https://doi.org/10.1016/j.patrec.2023.08.020
Related Identifier	IsSupplementTo https://doi.org/10.1007/978-3-031-75823-2_8
Related Identifier	IsVersionOf https://doi.org/10.48436/kfrat-rpn74
Metadata Access	https://researchdata.tuwien.ac.at/oai2d?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:researchdata.tuwien.ac.at:xvy1m-jwg83

Provenance
Creator	Iglesias Vazquez, Felix (ORCID: 0000-0001-6081-969X)
Publisher	TU Wien
Publication Year	2024
Rights	Creative Commons Attribution 4.0 International; GNU General Public License v3.0 or later; https://creativecommons.org/licenses/by/4.0/legalcode; https://www.gnu.org/licenses/gpl-3.0-standalone.html
OpenAccess	true
Contact	tudata(at)tuwien.ac.at

Representation
Resource Type	Software
Version	1.0.0
Discipline	Other