Debug Xray in NixOS

From SSH Lockout to a Perfect NixOS Image: A Debugging Odyssey

Deploying a custom service on a fresh server should be straightforward. But sometimes, a simple task spirals into a multi-layered journey that tests your assumptions and tools at every level. This is the story of one such journey: a tale of deploying a secure Xray proxy on NixOS that started with a simple SSH failure and ended with a deep dive into kernel syscalls. Here are the key insights from that odyssey.

The Initial Failure: Mismatched Blueprints

The first attempt involved using nixos-anywhere to overwrite a fresh DigitalOcean droplet. The command finished, but the server was unreachable—no SSH, no console. The initial suspect, a cloud firewall, was a red herring.

The real culprit was a fundamental mismatch in our NixOS configuration. The configuration.nix specified a legacy BIOS boot (efiSupport = false;), but the disko.nix partitioning script was creating a modern EFI-style disk layout with a separate /boot partition. The system literally didn’t know how to start itself.

Insight: Your disk partitioning scheme and your bootloader configuration must agree. Ensure your disko.nix or filesystem config aligns with your boot.loader.grub.efiSupport setting.

The Second Failure: A Race Against Time

After fixing the boot configuration, the installation proceeded but the server was still unreachable. This time, the deployment logs revealed a clue: chown: invalid user: ‘cloudflare-dyndns’. We accessed the VM’s logs through the DigitalOcean Recovery Console after a root password reset.

# Get to an interactive shell in the Recovery Console
# Then mount the disk and view logs from the failed boot
sudo mount /dev/vda2 /mnt
sudo journalctl --directory=/mnt/var/log/journal --boot=-1

An activationScript was set up to create a Cloudflare token file and set its ownership. However, this script was running before the services.cloudflare-dyndns module had a chance to create the cloudflare-dyndns user. This race condition halted the boot process.

Insight: For creating files with specific ownership at boot, avoid activationScripts. Use the robust, idiomatic systemd.tmpfiles.rules instead, which is designed for exactly this purpose and runs at the correct time.

The Pivot: When in Doubt, Build an Image

After two failures with the “overwrite” method, it was time for a more reliable strategy: building a custom DigitalOcean image. Instead of trying to change a running system, we would build a perfect, bootable image from scratch. This introduces nixos-generators, a powerful tool for creating cloud images from a Nix flake.

Insight: Direct installation tools like nixos-anywhere can be brittle. Building a custom, immutable image with a tool like nixos-generators is often a more robust and reliable deployment method.

The Iterative Debugging Loop for Custom Images

Debugging a bootable image requires a fast iteration cycle. After each configuration change, the following loop was executed:

  1. Build the image:

    nix build .#digitalocean-image
    
  2. Prepare the image for QEMU: (inside nix-shell -p qemu)

    # Copy to /tmp and decompress
    cp ./result/nixos-image-digital-ocean-*.qcow2.gz /tmp/my-nixos-image.qcow2.gz
    cd /tmp
    gunzip my-nixos-image.qcow2.gz
    # Add write permissions for QEMU
    chmod +w my-nixos-image.qcow2
    
  3. Boot in QEMU (headless with port forwarding):

    qemu-system-x86_64 \
      -m 2048 \
      -enable-kvm \
      -drive file=/tmp/my-nixos-image.qcow2,format=qcow2 \
      -netdev user,id=net0,hostfwd=tcp::2222-:22,hostfwd=tcp::4443-:443 \
      -device virtio-net-pci,netdev=net0 \
      -nographic
    

    (Note: 4443 was forwarded for Xray testing; 2222 for SSH access to the VM.)
    Now Xray server started. You can test it with:

    eget xtls/xray-core -a linux-64  # install a xray client
    xray -c test/client-config.json  # start the xray client
    curl --socks5-hostname localhost:1080 https://ifconfig.me
    

    If something went wrong, you have to go the next step:

  4. SSH into the VM: (from a new host terminal)

    ssh -p 2222 -i ~/.ssh/id_ed25519 root@localhost
    

    Here you can do some researches such as systemctl status xray, etc.
    Update the configuration.nix and prepare for the next loop:

  5. Clean Up
    First stop QEMU VM: poweroff (Inside the SSH session to the VM)
    or from QEMU terminal: Ctrl-a then x.
    Then delete the image file: rm my-nixos-image.qcow2.

The Final Boss: Debugging the Silent Failure

The custom image was built and booted in a local QEMU VM for testing. The service was running, but it couldn’t connect to the internet. The curl test through the proxy simply hung. This began a deep dive into the service’s behavior.

  1. Is it running? We checked the service status.

    systemctl status xray
    
  2. Is it listening? We verified Xray was bound to port 443.

    sudo ss -lntp | grep 443
    
  3. Is the config correct? We inspected the generated Xray configuration.

    systemctl cat xray.service
    cat /run/credentials/xray.service/config.json # Actual path from systemctl output
    
  4. Is it DNS? A quick doggo ifconfig.me inside the VM proved DNS was working perfectly.

    doggo ifconfig.me
    
  5. Does the VM have internet? Direct curl from the VM showed outbound access was fine.

    curl https://ifconfig.me
    
  6. The strace Turning Point (IPv6): When logs are silent, you must go deeper. We attached strace to the running Xray process to watch its direct communication with the Linux kernel.

    # Identify the server process PID
    pgrep -u xray -l xray
    # Attach strace to the server PID, tracing network calls
    sudo strace -p <SERVER_PID> -e trace=network
    # In a new terminal, trigger the hang with the client
    curl --socks5-hostname localhost:1080 https://ifconfig.me
    

    The output was the smoking gun:

    connect(...) = -1 ENETUNREACH (Network is unreachable)
    

    The sandbox wasn’t the issue. The Xray process was trying to connect out using IPv6 on a network that only supported IPv4. The kernel correctly told it the network was unreachable.

Insight: When an application fails silently, strace is your most powerful tool. It bypasses application logs and shows you the ground truth of what the process is asking the kernel to do.

  1. The Final strace (Internal Hang): Even after fixing IPv6, the curl still hung, and strace -e trace=network showed nothing. This indicated an internal hang before even attempting a network call. A full strace revealed the truth.
    # Identify the server process PID
    pgrep -u xray -l xray
    # Attach strace comprehensively, outputting to a file
    sudo strace -f -p <SERVER_PID> -e all -o /tmp/xray_trace.log
    # In a new terminal, trigger the hang with the client
    curl --socks5-hostname localhost:1080 https://ifconfig.me
    # After a few seconds, stop strace and view the tail of the log
    tail -n 20 /tmp/xray_trace.log
    
    This trace revealed the process was silently blocked by SystemCallFilter and CapabilityBoundingSet, preventing it from making outbound connections.

The “One Last Thing” Moment: A REALITY Check

After disabling IPv6, curl still failed, but this time with a new, application-level error from the Xray client: REALITY: received real certificate. This error means the client and server’s cryptographic handshake failed. The cause? A simple but critical human error: the server’s private key had been copied into the client’s public key field. Generating the correct public key from the private key and placing it in the client config finally solved the issue.

Insight: Always double-check your cryptographic inputs. A privateKey and publicKey are a pair, not interchangeable copies.

Conclusion

This debugging journey was a powerful lesson in methodical problem-solving. It reinforced that when deploying complex systems, the problem is rarely where you first look. By starting with the high-level infrastructure and systematically drilling down—from bootloaders to systemd services, and finally to kernel syscalls and application logic—we were able to find and fix every issue, resulting in a perfectly configured, secure, and deployable NixOS image.

P.S.: when running Zellij in Git Bash in Winodws with mouse mode enabled in Zellij, selecting and copying texts can be really tricky: press Shift+left mouse button and drag to select texts. When you release the mouse, the texts selected will go to the clipboard automatically.

Comments

Popular posts from this blog

2023: On the Road

Yet another advice to kids

The Joy of Reading in Natural Light