Embrace the Changes in ML World

Since the first release of ETA (encrypted traffic analysis) POC in December 2021, more than one year past. In February last year, we were planning to make a step forward by generating encrypted traffic in real scenarios and training models on them. However, things are far more complicated than we once supposed. A tough problem we're facing is, the performance of the model trained on the CTU-13 public dataset is poor when doing prediction jobs on the firewalls in users' networks. This is understandable since there's a big difference between the environments of CTU-13 and users. The problem is, how to deal with it.

Last year it took me lots of time to study it. The following is a basic summarization, for the lecture next week:

Living in a changing world, it's not strange that a drift (change) will happen sooner or later. However, that doesn't mean dealing with it is easy. The first question we must answer is: what a drift is? Then when a drift appears, how do we find it?  Simply put, there're 5 facets to the data drift problem:

1. Covariate vs concept vs label: what changes? Features, labels, or their relations?

2. Categorical vs continuous: for different types of features and labels, there are different analyzing methods;

3. Multivariate vs univariate: is the drift examined in a feature-wised fashion, or use all features as a whole;

4. Online vs offline: is the drift detected on each sample in sequence, or we must collect all samples before detection?

5. Supervised vs unsupervised: is the drift detected based on features, or labels, or both?

So much for now. Data drift is an attractive topic. I'll write more about it.

Comments

Popular posts from this blog

2023: On the Road

Yet another advice to kids

The Joy of Reading in Natural Light