Darkmatter in Cyberspace

Posts

Showing posts from February, 2023

Research Plan for ETA

February 24, 2023

Today is a warm, sunny weekend. The smell of spring fills in the air. After a year-long preparation of supporting techniques, I think it's time to draw some new pictures in the field of ETA research starting in December 2021. In the past 2022, I spent lots of time trying to answer the question: why does a model that works well on the training set degrade significantly in a new environment? Can we predict the degradation before the model's prediction? That's where the incremental learning and data/concept drift came in. In the first several months I focused on incremental learning algorithms, mainly on the Hoeffding tree implementation. However, the deployment style of incremental learning models is quite different from traditional batch learning models, which makes its adoption more difficult in production. The good news is these researches lead me to detect algorithms of concept drift, then generalize to other forms of drift: covariate and label drifts, and finally domain ...

Deployment of Console Applications

February 22, 2023

The first version of Python I used was 2.5.1 when I had no idea what "open source" was. After almost 20 years, Python is still young and becoming more perspective day by day. When I did my Ph.D. research jobs in BIT, Python is a meshing tool for Abaqus, a CAE software used in our lab. Its power lay in the automation of manual meshing workflow. After I got a job as a software developer, Python was used mostly as a powerful DevOps tool. When I began developing scripts for data science, it was the best choice with the help of Numpy and Pandas. Nobody denies Python's power as a scripting language. But a new area where it began to shine is application development and deployment. With the help of PEP 621, you can write Python codes in a concise way as a standalone application, version control and publish to code sharing platforms like github or bitbucket, and deploy it with `pipx`. There's no need to write setup.py anymore. Instead, you can use `poetry` or `pdm` to build a ...

The Joy of Reading in Natural Light

February 18, 2023

Nowadays I read books in paperbacks much less than ebooks. Not because I don't like reading paper books, but after reading for so many years on screen, my eyes are very sensitive to the brightness of reading materials. It's very hard for me to read characters when the material is not bright enough. Unfortunately, the light in my living room is not bright enough for me. This morning it was cloudy, while in the afternoon it got clear, and the blue sky and white clouds appeared. So I moved the desk near the window. To my surprise, the characters on paper books are much more clear and more comfortable to read than in artificial light. Maybe such a simple fact is not worthy to mention. But to me, a programmer spent too much time with the keyboard and screen, this reminds me of the memories of childhood, reading a book, or writing homework in the yard. The characters blurred at sunset. That's when we were going to have super. The window of my living room is east-forwarding, so to...

Met with ChatGPT

February 15, 2023

A couple of days ago I registered an account with the help of Sharon, for OpenAI forbids users from mainland China, Hong Kong, Russia, and Iran. This is unthinkable 10 or more years ago when people believe globalization benefits all mankind, no matter who and where you are, or what you believe. However, it's also understandable since some high-techs developed in western countries have been abused by some autocratic governments to deprive people's basic human rights. Strictly speaking, this isn't my first time interacting with AI. In recent years the search box of Google is becoming more and more intelligent by auto-completion. Years ago auto-completion only happens at the end of the word list input by users. Now it can match fuzzily even in the middle of the word list. But talking literally with an AI is still a very novel experience. The biggest difference with what I imagine when talking with an AI is its "transparency". It has clear statements about who it is, ...

Embrace the Changes in ML World

February 11, 2023

Since the first release of ETA (encrypted traffic analysis) POC in December 2021, more than one year past. In February last year, we were planning to make a step forward by generating encrypted traffic in real scenarios and training models on them. However, things are far more complicated than we once supposed. A tough problem we're facing is, the performance of the model trained on the CTU-13 public dataset is poor when doing prediction jobs on the firewalls in users' networks. This is understandable since there's a big difference between the environments of CTU-13 and users. The problem is, how to deal with it. Last year it took me lots of time to study it. The following is a basic summarization, for the lecture next week: Living in a changing world, it's not strange that a drift (change) will happen sooner or later. However, that doesn't mean dealing with it is easy. The first question we must answer is: what a drift is? Then when a drift appears, how do we find ...

Interpretable AI Series (3) Model-agnostic methods: Global interpretability

February 08, 2023

The 3rd chapter is mainly about the first part of the model-agnostic methods: global interpretability, which interprets the influences of features on the dependent variable (label) on the whole model level. The model used in this chapter is the random forest, or decision tree ensemble, while the method can be easily extended to other black-box models. Here "model-agnostic" means these methods can be used to interpret any model without modifying the details of the method. The first section introduced the dataset: high school student performance and several other features, such as race, parent level of education, etc, followed by an introduction of exploratory data analysis (EDA for short). Here the author mainly used histograms to show the grade distribution according to different categorical features, and their combinations. The following section recaps tree ensembles, including 2 possible approaches of an ensemble: bagging (random forest) and boosting (adaptive boosting and ...

The Art of Wearing Mask

February 04, 2023

Covid-19 in Beijing is diminishing, at least for now. Everyday life is coming back to that before the pandemic. But something is definitely changed. For example, 95% of people here wear masks in shopping malls and other indoor places, which gives some interesting clues to people's thoughts. Many people wear "neck" masks. that is, although the ear loops are still hanging on the ears, the mask itself is pulled down to the neck, rather than covering the face. Do their necks need to be protected? Obviously no. Then why do they wear a mask in this way? Maybe it is a good demonstration of the latent functionality of face masks in an autocratic and mutually supervised society: the identity of an obedient citizen. As protective equipment, we know that it's used by respiratory doctors and patients as a mechanical barrier that interferes with airflow in and out of the mouth and nose. However, why masks can be used as a social identity? Well, this is a story from 3 years ago. In...

Interpretable AI Series (2) White-box models

February 01, 2023

This post mainly covers chapter 2 of "Interpretable AI". White-box models are simple and straightforward machine learning models. To be concrete, the linear regression model assumes that the label (also known as the result variable, or dependent variable) is the linear combination of the input features. So the absolute value of each feature's coefficient represents the share of the influence, including positive and negative. For example, a company plans an advertising budget on both TV and newspapers. What's the best ratio of these 2 channels? The answer is: depends on the prediction of the regression model. Let's say the model shows the distribution of annual incomes is: one-third from newspapers and two-thirds from TV, then we should put two-thirds of the budgets on TV, and one-third on newspapers. The main limitation of the linear model is that most relationships between input and output aren't linear. So it can give us a comparatively rough prediction. Th...