Debug Xray in NixOS
From SSH Lockout to a Perfect NixOS Image: A Debugging Odyssey
Deploying a custom service on a fresh server should be straightforward. But sometimes, a simple task spirals into a multi-layered journey that tests your assumptions and tools at every level. This is the story of one such journey: a tale of deploying a secure Xray proxy on NixOS that started with a simple SSH failure and ended with a deep dive into kernel syscalls. Here are the key insights from that odyssey.
The Initial Failure: Mismatched Blueprints
The first attempt involved using nixos-anywhere to overwrite a fresh DigitalOcean droplet. The command finished, but the server was unreachable—no SSH, no console. The initial suspect, a cloud firewall, was a red herring.
The real culprit was a fundamental mismatch in our NixOS configuration. The configuration.nix specified a legacy BIOS boot (efiSupport = false;), but the disko.nix partitioning script was creating a modern EFI-style disk layout with a separate /boot partition. The system literally didn’t know how to start itself.
Insight: Your disk partitioning scheme and your bootloader configuration must agree. Ensure your
disko.nixor filesystem config aligns with yourboot.loader.grub.efiSupportsetting.
The Second Failure: A Race Against Time
After fixing the boot configuration, the installation proceeded but the server was still unreachable. This time, the deployment logs revealed a clue: chown: invalid user: ‘cloudflare-dyndns’. We accessed the VM’s logs through the DigitalOcean Recovery Console after a root password reset.
# Get to an interactive shell in the Recovery Console
# Then mount the disk and view logs from the failed boot
sudo mount /dev/vda2 /mnt
sudo journalctl --directory=/mnt/var/log/journal --boot=-1
An activationScript was set up to create a Cloudflare token file and set its ownership. However, this script was running before the services.cloudflare-dyndns module had a chance to create the cloudflare-dyndns user. This race condition halted the boot process.
Insight: For creating files with specific ownership at boot, avoid
activationScripts. Use the robust, idiomaticsystemd.tmpfiles.rulesinstead, which is designed for exactly this purpose and runs at the correct time.
The Pivot: When in Doubt, Build an Image
After two failures with the “overwrite” method, it was time for a more reliable strategy: building a custom DigitalOcean image. Instead of trying to change a running system, we would build a perfect, bootable image from scratch. This introduces nixos-generators, a powerful tool for creating cloud images from a Nix flake.
Insight: Direct installation tools like
nixos-anywherecan be brittle. Building a custom, immutable image with a tool likenixos-generatorsis often a more robust and reliable deployment method.
The Iterative Debugging Loop for Custom Images
Debugging a bootable image requires a fast iteration cycle. After each configuration change, the following loop was executed:
-
Build the image:
nix build .#digitalocean-image -
Prepare the image for QEMU: (inside
nix-shell -p qemu)# Copy to /tmp and decompress cp ./result/nixos-image-digital-ocean-*.qcow2.gz /tmp/my-nixos-image.qcow2.gz cd /tmp gunzip my-nixos-image.qcow2.gz # Add write permissions for QEMU chmod +w my-nixos-image.qcow2 -
Boot in QEMU (headless with port forwarding):
qemu-system-x86_64 \ -m 2048 \ -enable-kvm \ -drive file=/tmp/my-nixos-image.qcow2,format=qcow2 \ -netdev user,id=net0,hostfwd=tcp::2222-:22,hostfwd=tcp::4443-:443 \ -device virtio-net-pci,netdev=net0 \ -nographic(Note:
4443was forwarded for Xray testing;2222for SSH access to the VM.)
Now Xray server started. You can test it with:eget xtls/xray-core -a linux-64 # install a xray client xray -c test/client-config.json # start the xray client curl --socks5-hostname localhost:1080 https://ifconfig.meIf something went wrong, you have to go the next step:
-
SSH into the VM: (from a new host terminal)
ssh -p 2222 -i ~/.ssh/id_ed25519 root@localhostHere you can do some researches such as
systemctl status xray, etc.
Update the configuration.nix and prepare for the next loop: -
Clean Up
First stop QEMU VM:poweroff(Inside the SSH session to the VM)
or from QEMU terminal:Ctrl-athenx.
Then delete the image file:rm my-nixos-image.qcow2.
The Final Boss: Debugging the Silent Failure
The custom image was built and booted in a local QEMU VM for testing. The service was running, but it couldn’t connect to the internet. The curl test through the proxy simply hung. This began a deep dive into the service’s behavior.
-
Is it running? We checked the service status.
systemctl status xray -
Is it listening? We verified Xray was bound to port 443.
sudo ss -lntp | grep 443 -
Is the config correct? We inspected the generated Xray configuration.
systemctl cat xray.service cat /run/credentials/xray.service/config.json # Actual path from systemctl output -
Is it DNS? A quick
doggo ifconfig.meinside the VM proved DNS was working perfectly.doggo ifconfig.me -
Does the VM have internet? Direct
curlfrom the VM showed outbound access was fine.curl https://ifconfig.me -
The
straceTurning Point (IPv6): When logs are silent, you must go deeper. We attachedstraceto the running Xray process to watch its direct communication with the Linux kernel.# Identify the server process PID pgrep -u xray -l xray # Attach strace to the server PID, tracing network calls sudo strace -p <SERVER_PID> -e trace=network # In a new terminal, trigger the hang with the client curl --socks5-hostname localhost:1080 https://ifconfig.meThe output was the smoking gun:
connect(...) = -1 ENETUNREACH (Network is unreachable)The sandbox wasn’t the issue. The Xray process was trying to connect out using IPv6 on a network that only supported IPv4. The kernel correctly told it the network was unreachable.
Insight: When an application fails silently,
straceis your most powerful tool. It bypasses application logs and shows you the ground truth of what the process is asking the kernel to do.
- The Final
strace(Internal Hang): Even after fixing IPv6, thecurlstill hung, andstrace -e trace=networkshowed nothing. This indicated an internal hang before even attempting a network call. A fullstracerevealed the truth.
This trace revealed the process was silently blocked by# Identify the server process PID pgrep -u xray -l xray # Attach strace comprehensively, outputting to a file sudo strace -f -p <SERVER_PID> -e all -o /tmp/xray_trace.log # In a new terminal, trigger the hang with the client curl --socks5-hostname localhost:1080 https://ifconfig.me # After a few seconds, stop strace and view the tail of the log tail -n 20 /tmp/xray_trace.logSystemCallFilterandCapabilityBoundingSet, preventing it from making outbound connections.
The “One Last Thing” Moment: A REALITY Check
After disabling IPv6, curl still failed, but this time with a new, application-level error from the Xray client: REALITY: received real certificate. This error means the client and server’s cryptographic handshake failed. The cause? A simple but critical human error: the server’s private key had been copied into the client’s public key field. Generating the correct public key from the private key and placing it in the client config finally solved the issue.
Insight: Always double-check your cryptographic inputs. A
privateKeyandpublicKeyare a pair, not interchangeable copies.
Conclusion
This debugging journey was a powerful lesson in methodical problem-solving. It reinforced that when deploying complex systems, the problem is rarely where you first look. By starting with the high-level infrastructure and systematically drilling down—from bootloaders to systemd services, and finally to kernel syscalls and application logic—we were able to find and fix every issue, resulting in a perfectly configured, secure, and deployable NixOS image.
P.S.: when running Zellij in Git Bash in Winodws with mouse mode enabled in Zellij, selecting and copying texts can be really tricky: press Shift+left mouse button and drag to select texts. When you release the mouse, the texts selected will go to the clipboard automatically.
Comments
Post a Comment