Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding heuristic-based outlier detection #611

Open
clevilll opened this issue Nov 27, 2024 · 0 comments
Open

Understanding heuristic-based outlier detection #611

clevilll opened this issue Nov 27, 2024 · 0 comments

Comments

@clevilll
Copy link

clevilll commented Nov 27, 2024

I'm struggling to understand mathematics and relate to new cross-density-distance outlier detection algorithms that were newly published in the Computer & Security journal. I already created an issue in PyOD Repo and mentioned the general approach in this recent paper for further consideration if this approach is valid to compare with other algorithms in [tag:PyOD] pythonic package.

Background:

I'm familiar with known unsupervised anomaly/outlier detection:

  • density-based
  • distance-based
  • autoencoder
  • model-based
  • cross-density-distance

based on this post linked with this:

There is no such algorithm "which works in most cases". The task heavily depends on the specifics of your case, e.g. whether you need local anomalies when a point differs from other points near it or global ones when a point does not look similar to any other point in the dataset. A very good review of anomaly detection algorithms can be found in here

"A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data" by Markus Goldstein, Seiichi Uchida

  • I'm relatively familiar with general processes within unsupervised OD algorithms which end up with scoring outlierness so-called "outlier score" considering anomalous rate or contamination rate of 5% to 10% which set it up using the threshold for a related (hyper-)parameter i.e contamination=0.05

    Based on my findings, I really liked the procedure suggested in the following book including 3 steps of modeling procedure for anomaly detection involves:

    1. model development
    2. threshold determination
    3. feature evaluation

    "Handbook of Anomaly Detection: With Python Outlier Detection examples" by Chris Kuo, published in Dataman in AI

    Later they generate mock data with a certain contamination rate and split data into train-set and test-set

  • I'm also informed about geometry approaches being used in algorithms like dimension reduction algorithms like LDA and PCA using projection to find the best fit to separate observation of different classes

My question\Concern:

I saw for example how ECOD or HBOS works but the approach used in the discussed journal paper and its Math is vague and unclear and since I explained issue I avoid explaining but I will list my Questions:

  • Q1. Does it make sense for one to calculate the outlier score and apply the confidence score and on top of it, we start allocating weight? This logic if we accept it with closed eyes, how could be indicated in the cubic form based on equation (Slow installation due to the underlying dependencies #18) in the paper:

Compute the final ensemble score $O_i$ for data point $X_i$ passed through all submodels
$$E(O, W)=\sum_{i=1}^n\left(s_i \cdot s_{c i} \cdot \omega_i\right)^3$$

Throughout our experiment, we set the density score weight ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) to 0.8 and the distance score weight ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) to 0.2.
It's not clear how the author comes up even empirically with these coefficients to weigh and mix scoring the way the user prefers or gets better accuracy if he has labels!

The impact of this approach is indicated in Fig 4 in paper between density-based and distance-based scoring using empirical coefficients denoted in equation (#15) as weights or the impact of human assistance!

  • Q2. generally speaking, modifying some scoring using human factor, does not mean that manipulating scoring to get the best results while the art is finding non-parametric detection to minimize human factor parameterizations?

Maybe I'm not fully informed and as an enthusiast want to understand the reliability and validity of this promising detection however if we are going towards unsupervised OD\AD then we should go towards non-parametric algorithms while setting proposed parameters ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) and ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) as human-factored is not against of spirit of unsupervised algorithms and selling the ideas as Human-assisted model?

  • Q3. which logic could explain (XAI or explainability) of detection scoring should be ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) > ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) means intentionally we force ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) will be dominant than ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$)? Can one explain based on empirical experiments of contextualized features to use this trick specifically over imbalanced data which is no guaranteed if they are normal distribution in the form of Gaussian distribution?

This approach is sound's like ad-hoc'ery in outlier detection, as practiced normally by non-statisticians... ref

How to test unsupervised learning methods for anomaly detection?


Side notes:

  • Please see this post to understand the "difference between Outlier and Anomaly in the context of machine learning"
  • Please see these posts to understand OD\AD in the domain of IT and network\cyber security context: post1 post2 post3 post4
  • Please see this post to understand "...anomaly detection should not be considered as classification problem" or binary-classification
  • Please see these posts to understand in OD\AD, overfitting is not solution: post1 post2
  • Please see this post to understand the data split for OD\AD models: post1
  • see to understand how to select the threshold for unsupervised anomaly detection post1
  • understand OD\AD over imbalanced data or heavy-tailed data or skewed data\distribution post1 post2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant