Posts

Debug Xray in NixOS

From SSH Lockout to a Perfect NixOS Image: A Debugging Odyssey Deploying a custom service on a fresh server should be straightforward. But sometimes, a simple task spirals into a multi-layered journey that tests your assumptions and tools at every level. This is the story of one such journey: a tale of deploying a secure Xray proxy on NixOS that started with a simple SSH failure and ended with a deep dive into kernel syscalls. Here are the key insights from that odyssey. The Initial Failure: Mismatched Blueprints The first attempt involved using nixos-anywhere to overwrite a fresh DigitalOcean droplet. The command finished, but the server was unreachable—no SSH, no console. The initial suspect, a cloud firewall, was a red herring. The real culprit was a fundamental mismatch in our NixOS configuration. The configuration.nix specified a legacy BIOS boot ( efiSupport = false; ), but the disko.nix partitioning script was creating a modern EFI-style disk layout with a separate /boot ...

My Sleeping Solutions

Timing: 1. Find out the sleepy window: mine is between 22:00 ~ 23:00. If I miss it, it's very likely to stay up late, because I am not sleepy; 2. No eating after 20:30 (90min before the window); 3. Start preparation for sleep from 21:00 (a hour before the window): teeth brushing, bathing, making the bed, going to the toilet, etc; 4. Be gentle: I won't blame myself if I missed the timing. I tell myself: do it right tomrrow; 5. I am a big fans of audiobook. I listen my favorite audio books when lying on the bed; When sleep: 1. Keep the room absolutely dark. Turn off power indicator lights, charging lights, nightlights, etc. So there's (almost) no difference between opening and closing eyes. 2. Move anything making sound out of the bedroom, so I can enjoy the sound from the nature; 3. Keep  the bedroom cool (people are easy to feel sleepy in a cool environment);

Log Aggregation Study

一、 基于日志模板解析的聚合算法(结构化方法) 这类算法的核心思想是**“去变量保结构”**。通过识别日志中的常量部分(模板)和变量部分(参数),将同一模板的日志归为一类。这是解决“时间戳、随机ID变化”最直接、性价比最高的方法。 1. Drain (推荐指数:⭐⭐⭐⭐⭐) 原理 :一种基于固定深度树的在线解析算法。它假设日志的开头部分通常是固定的,通过预定义的深度和相似度阈值,快速将日志流分组并提取模板。 适用性 : 优势 :完全符合文档中Flink流引擎的场景,处理速度极快,几乎不增加延迟。能精准剔除时间戳、SessionID等变量。 解决痛点 :完美解决“同一攻击因payload细微差异被拆分”的问题。 落地建议 :作为聚合的第一道防线。例如将 SQL Injection id=123 和 SQL Injection id=456 解析为模板 SQL Injection id=<*> ,基于模板Hash进行聚合。 2. Spell (流式最长公共子序列) 原理 :基于LCS(最长公共子序列)在线提取日志模板。 适用性 :适用于日志长度变化较大,但核心关键词位置相对固定的场景。比Drain稍慢,但容错性略好。 3. LogPai / LogParse 说明 :这是一系列工业级日志解析工具的集合,集成了多种算法(如IPLoM, LenMa等),适合离线分析或批处理引擎场景,用于建立基线。 二、 基于局部敏感哈希的聚合算法(高性能去重) 当日志无法完全结构化(如加密流量、非标准文本)时,需要比对文本的相似度。传统Hash要求输入完全一致,而LSH允许内容相似但不完全相同的日志生成相同的Hash值。 1. SimHash (推荐指数:⭐⭐⭐⭐) 原理 :将高维特征向量映射为低维指纹,通过计算海明距离判断相似度。 适用性 : 场景 :非常适合文档中提到的“payload细微差异”场景。 落地 :提取日志的关键特征(如IP、URL路径、特定关键词),生成SimHash指纹。若两条日志指纹的海明距离小于阈值(如3),则聚合。 优势 :计算复杂度低,极适合海量数据实时去重。 2. MinHash 原理 :利用Jaccard相似度原理,通过随机置换哈希快速估算集合的相似度。 适用性 :比S...

No Escape

In the Economist article The world’s waste problem is growing fast , Mr. Oliver did an excellent job as a journalist: digging and gathering truth, putting them on the table, so everyone have to face it, think about it and try to fix it in different ways. The article is an introduction to Mr. Oliver’s book, Wasteland. In the book he wrote down what he saw in some special places, where he called wasteland, to disclose some shady corners in global and local economic network, such as plastic pollution, toxic colonialism, planned obsolescence, etc. Plastic pollution is the nearest problem related with our daily life. In China years ago the plastic bags are free in super markets and grocery stores. Then the government (at least in Beijing) enacted a law about mandatory paid plastic bags. From then on there’s a charge for plastic bags in super markets, while free in most small grocery stores. People are more and more aware of the plastic pollutions. However, at least by my observation, the ...

Mouth vs Brain

A couple of days ago I read 2 articles about LLM (large language model) and ChatGPT. The first one is How ChatGPT and Other LLMs Work—and Where They Could Go Next on Wired. It mainly discussed the following topic: The main idea behind the LLM: with some user-provided words (usually a quesion), LLM choose the most suitable following without understanding the meaning of the question and the sentence it emits. Technical lineage: neural network to transformer, then to LLM. How human participation promotes the model quality. Here the article introduced the basic idea of RLHF (reinforcement learning on human feedback), the silver bullet of ChatGPT. The second article How ChatGPT actually works contains far more technical detals. It explains why ChatGPT-4 is superior than its precursors (better alignment), the detailed steps of RLHF, how to evaluate the performance of a LLM, and finally the shortcomings of the methodology used in current model training. After reading above articles, ...

The Value of Group Sports

This afternoon we watched the movie "Slam Dunk". Besides the exciting offense and defense scenarios on the count and the delicate narrations of the relationship between family members, I have some new thoughts on sports and games. People in East Asia traditionally believe that intellectual training is the essence of education. Compared with it, sports are more or less optional and subordinate in the educational system. P.E. classes are frequently occupied by language, Math, or science classes. In junior and senior school, I was a fan of ping-pong. It took most of my spare time and even in class to read books, and journals about it, and practice basic and complex skills. Thirty years later, I have to say the young me missed some essential parts of games. Besides strategies, tactics, and skills, games are more about communication and cooperation. The basketball court is always full of physical contact, fierce confrontations, and even altercations and injuries. However, you must...

Life on Habitica

After installing Habitica on my phone, 2 weeks passed. It's really a wonderful tool, like AntennaPod and @Voice. which changed my lifestyle dramatically. I put a couple of items on it. Please forgive me for listing them out. Tasks in channel "Habits" have a plus button. Every time you finish a task, click the button to get the corresponding rewards (experiences and coins, with which you can buy food or equipment). The tasks here can also have an optional "minus" button. Every time you failed the task, click the button to accept a punishment (negative rewards). By far I've added only one habit: going to bed before 23:00. Tasks in channel "Daily" need to be practiced on a daily/weekly/monthly basis. You can choose how many times the task must be done in a time unit. Here's my daily task: Listen to an English/Japanese/Cantonese episode (podcast with AntennaPod, audiobook/magazine with Musicolet); Learning Japanese/Cantonese on Duolingo; Learning n...

Life as A RPG Game

We all have many ideas about our lives, learning new skills, keeping healthy, reading an interesting book, going to sleep earlier... However, as people say, buying a book is always easier than reading it. We scheduled so many wonderful plans while forgetting most of them. Year after year, we stay at the starting point, imagining a better version of ourselves in a parallel universe. Why can't we keep the promises and make real changes? An important factor is that, as time goes by, keeping promise becomes more and more boring, and replaced by new (while never practiced) plans. Why do our passions diminish day by day? Because we can't get feedback in everyday life, since fundamental changes need a long time to take effect. Here games provide us a perspective to solve the issue. Why our passion for games never fades? Because in games we get feedback very soon. If we can feedback as soon as possible on reading books, keeping exercising, they may not be abandoned as time goes by. Hab...

A Good Tech Writing Tool

Tech writing has some different features than general writing. For example, for technical documents, there're more or less some code blocks in the text. Their rendering is key to a better comprehension of the article. In many cases, the code blocks are "live" rather than static. They are executed and the outputs (calculation results, including texts and images) are inserted into the text, as a part of the article. Another key component of tech writing is math formulas, whose rendering is also very important. Nowadays MathJax is more prevalent than LaTeX. Other requirements include auto-numbered sections, cross-reference, writing in plain text (hence easily integrated with the version control system), and report generation in popular formats (such as HTML, PDF), etc. The tools or frameworks I've used for tech writing include Asciidoc, Rmarkdown, and Jupytext, with different pros and cons respectively. Today I found quarto  (based on pandoc). Its feature set fits perfec...

A Brave Chinese

 Today I listened a record of an old Chinese' comments about the relationship with Russia and the U.S. That's exactly what I wanna say. As what he said, in the Chinese modernization process, the U.S. played an important role. However, the CCP teach our kids to hate it, like what they do 70 years ago. What did we get after that? Now, they play the trick again. When I talked with my parents, I always feel pity and anger for them. They're very nice people. They don't hurt anyone. They obey all rules and orders, with or without consent. They worked hardly all their lives. They saved every cent. But what they got? Poor education, distorted worldview based on systematic lies. Thanks to CCP's state-of-the-art brainwashing, they hate the U.S., who they always trying to help us. They embrace Russia, even they always bully us. If you don't agree with me about who's friend, and who's enemy, please listen the old man's comments completely. Tell me he told lies.

ChatGPT as a Persoanl Translator

 After reading a good article  about the mechanism of ChatGPT, I got the idea spontaneously that test the translation ability of ChatGPT with it. So I input the following texts in the ChatGPT prompt: > Please translate the article at the following URL to Chinese: https://www.assemblyai.com/blog/how-chatgpt-actually-works/ > 请将以下网址的文章翻译成中文:https://www.assemblyai.com/blog/how-chatgpt-actually-works/ > 如何ChatGPT实际工作的解释 > 在自然语言处理和人工智能领域,OpenAI的GPT系列已经成为一个备受瞩目的成就。它被用于许多领域,包括自动问答、机器翻译和聊天机器人等,其中最著名的是ChatGPT。在这篇文章中,我们将探讨ChatGPT是如何工作的,以及它在聊天机器人应用中的优缺点。 > 什么是ChatGPT? > ChatGPT是一种预训练模型,用于生成人类般的对话。它由OpenAI团队开发,利用了其最新的语言处理技术,包括transformer、自监督学习和预训练模型。ChatGPT是GPT-3的一个变体,但是在实现上更加专注于聊天机器人的应用。 > 如何训练ChatGPT? > ChatGPT的训练方式和其他GPT模型一样,采用了自监督学习的方法。这种方法需要大量的文本数据来训练模型,以便它可以学习语言的结构、语法和语义等方面。ChatGPT使用的文本数据是从互联网上收集的,并且经过了清理和预处理,以便模型可以更好地学习这些数据。 > ChatGPT的工作原理是什么? > ChatGPT的工作原理非常简单。当用户输入一个问题或者对话时,ChatGPT会使用自然语言处理技术来理解用户的意图,并生成一个响应。这个响应是通过模型中已经训练好的文本生成技术生成的,它会尽可能地模仿人类的对话方式...

Pending Reading List

This list contains excellent books I wanted to finish, however finding it challenging to read alone. So my dear friend, if you are interested in any of them, please let me know. And let's experience a fantastic reading, thinking, discussion and finally transcending journey together. Which books will be added to this list? Well, first, they are masterpieces in their fields. Most of them have scores above 4 on Goodreads or 8.5 on Douban. Second, the reading experience is wonderful. The beautiful text and fabulous ideas of the author make me careless of how much time it takes on reading it. In fact, it deserved to be read more than once. That said, how to read a book together since reading is a personal activity? IMO, we should distinguish casual reading (no more difference with watching movies or playing games) and intellectual reading, which is the essence of intellectual life. Intellectual life is seldom discussed nowadays. But it's the best lifestyle I can imagine so far. It d...

On Podcast Clients

As a podcast fan, I subscribed to a bunch of wonderful channels. For news, there's BBC News, Up First of NPR, World Story of the Day, Today in Focus, The Daily of New York Times, Chinese News of NHK, On the Media of WNYC, etc. For current affairs commentary, there're The Foreign Affairs Interview, The Political Scene of New Yorker, bumingbai.net, cdt.media, Sinica, China Insider, The NPR Politics Podcast, BBC Documentary, etc. Other categories include movies and songs commentary, science and technology, storytelling, games, language learning, etc. On Android, AntennaPod and Google Podcast are the most used. On Linux, gPodder is a good choice. As open-source software, it downloads all episodes as MP3 files, which can be synced to a local folder easily. It works well in WSL, thanks to the excellent job done by the WLS team of MS. MP3 exporting is particularly important in China mainland, for many of the channels are blocked and only be shared with files rather than links. The com...

A New Path of Higher Education

I'm reading the book "Building the International University: Minerva and the Future of Higher Education" by Stephen M. Kossyln these days. It's really a mind-blowing book about higher education. We all want to live in a safe, civilized society, living a decent, wonderful life. But how? Some say we need to be rich. Others say we need freedom and democracy. As the time flows into 2023, we know they're not enough. Middle-class Chinese in big cities are rich, but they feel no safety. Russia is a democratic country after USSR collapsed. However, the Russians voted to elect a dictator who is ruling for more than 20 years and pushed them into an absurd, brutal war. Then how about providing Chinese democracy? That sounds good. But if you've lived in China long enough, you'll find that the Chinese today probably elect another dictator democratically, like what the Russians did 20 years ago. So what are the missing puzzles in their jigsaw? One of them is political v...

Endless Wars in Cybersapce

I'm planning to do some research on threat intelligence in the next several months. For me, it's a brand new area, and hopefully, something interesting could be found this year. Nowadays cyberattack is becoming more and more complex and subtle. An essential problem faced by defenders is, attackers know what they know, while they don't know what the attackers' plots are. Cyberspace is different from the physical world, in the regard that there's no clear boundary between attackers and defenders. In the physical world, we can build forts with walls and trenches. We can see where attackers are, and what weapons they have. When we are defeated, at least we know who attack us, and what is their target (no more than lands, treasures, and slaves). While in cyberspace, a perfect attack targeting information exfiltration leaves no clue that it ever exists. The key difference lies in the fact that physical property is exclusive. It can only belong to one person at a specific ...