You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm struggling to understand mathematics and relate to new cross-density-distance outlier detection algorithms that were newly published in the Computer & Security journal. I already created an issue in PyOD Repo and mentioned the general approach in this recent paper for further consideration if this approach is valid to compare with other algorithms in [tag:PyOD] pythonic package.
There is no such algorithm "which works in most cases". The task heavily depends on the specifics of your case, e.g. whether you need local anomalies when a point differs from other points near it or global ones when a point does not look similar to any other point in the dataset. A very good review of anomaly detection algorithms can be found in here
"A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data" by Markus Goldstein, Seiichi Uchida
I'm relatively familiar with general processes within unsupervised OD algorithms which end up with scoring outlierness so-called "outlier score" considering anomalous rate or contamination rate of 5% to 10% which set it up using the threshold for a related (hyper-)parameter i.e contamination=0.05
Based on my findings, I really liked the procedure suggested in the following book including 3 steps of modeling procedure for anomaly detection involves:
model development
threshold determination
feature evaluation
"Handbook of Anomaly Detection: With Python Outlier Detection examples" by Chris Kuo, published in Dataman in AI
Later they generate mock data with a certain contamination rate and split data into train-set and test-set
I'm also informed about geometry approaches being used in algorithms like dimension reduction algorithms like LDA and PCA using projection to find the best fit to separate observation of different classes
My question\Concern:
I saw for example how ECOD or HBOS works but the approach used in the discussed journal paper and its Math is vague and unclear and since I explained issue I avoid explaining but I will list my Questions:
Q1. Does it make sense for one to calculate the outlier score and apply the confidence score and on top of it, we start allocating weight? This logic if we accept it with closed eyes, how could be indicated in the cubic form based on equation (Slow installation due to the underlying dependencies #18) in the paper:
Compute the final ensemble score $O_i$ for data point $X_i$ passed through all submodels $$E(O, W)=\sum_{i=1}^n\left(s_i \cdot s_{c i} \cdot \omega_i\right)^3$$
Throughout our experiment, we set the density score weight ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) to 0.8 and the distance score weight ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) to 0.2.
It's not clear how the author comes up even empirically with these coefficients to weigh and mix scoring the way the user prefers or gets better accuracy if he has labels!
The impact of this approach is indicated in Fig 4 in paper between density-based and distance-based scoring using empirical coefficients denoted in equation (#15) as weights or the impact of human assistance!
Q2. generally speaking, modifying some scoring using human factor, does not mean that manipulating scoring to get the best results while the art is finding non-parametric detection to minimize human factor parameterizations?
Maybe I'm not fully informed and as an enthusiast want to understand the reliability and validity of this promising detection however if we are going towards unsupervised OD\AD then we should go towards non-parametric algorithms while setting proposed parameters ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) and ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) as human-factored is not against of spirit of unsupervised algorithms and selling the ideas as Human-assisted model?
Q3. which logic could explain (XAI or explainability) of detection scoring should be ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) > ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$) means intentionally we force ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) will be dominant than ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$)? Can one explain based on empirical experiments of contextualized features to use this trick specifically over imbalanced data which is no guaranteed if they are normal distribution in the form of Gaussian distribution?
This approach is sound's like ad-hoc'ery in outlier detection, as practiced normally by non-statisticians... ref
I'm struggling to understand mathematics and relate to new cross-density-distance outlier detection algorithms that were newly published in the Computer & Security journal. I already created an issue in PyOD Repo and mentioned the general approach in this recent paper for further consideration if this approach is valid to compare with other algorithms in [tag:PyOD] pythonic package.
Background:
I'm familiar with known unsupervised anomaly/outlier detection:
based on this post linked with this:
I'm relatively familiar with general processes within unsupervised OD algorithms which end up with scoring outlierness so-called "outlier score" considering anomalous rate or contamination rate of 5% to 10% which set it up using the threshold for a related (hyper-)parameter i.e
contamination=0.05
Based on my findings, I really liked the procedure suggested in the following book including 3 steps of modeling procedure for anomaly detection involves:
Later they generate mock data with a certain contamination rate and split data into train-set and test-set
I'm also informed about geometry approaches being used in algorithms like dimension reduction algorithms like LDA and PCA using projection to find the best fit to separate observation of different classes
My question\Concern:
I saw for example how ECOD or HBOS works but the approach used in the discussed journal paper and its Math is vague and unclear and since I explained issue I avoid explaining but I will list my Questions:
The impact of this approach is indicated in Fig 4 in paper between density-based and distance-based scoring using empirical coefficients denoted in equation (#15) as weights or the impact of human assistance!
Maybe I'm not fully informed and as an enthusiast want to understand the reliability and validity of this promising detection however if we are going towards unsupervised OD\AD then we should go towards non-parametric algorithms while setting proposed parameters ($𝜔_{𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}$ ) and ( $𝜔_{𝑑𝑒𝑛𝑠𝑖𝑡𝑦}$ ) as human-factored is not against of spirit of unsupervised algorithms and selling the ideas as Human-assisted model?
This approach is sound's like ad-hoc'ery in outlier detection, as practiced normally by non-statisticians... ref
How to test unsupervised learning methods for anomaly detection?
Side notes:
The text was updated successfully, but these errors were encountered: