Home and Personal Infrastructure Overhaul: Part 1 - Introduction :: YetiOps — A view on the tech and open source world from a hairy human

Over the past month and a half, I have overhauled most of how my home and personal infrastructure is managed (as well as a couple of upgrades to machines I use).

Previously, most of my personal infrastructure was managed by single-purpose Ansible Playbooks and a minimal set of Salt states. When I first got back into Ansible (mid-to-late 2019) after a break of about 3-4 years, I didn’t have a great understanding of Ansible Roles.

These Playbooks and States served their purpose for quite a while. After having managed DNS/DHCP manually (including SQL database manipulation with PowerDNS) and no monitoring at all, this was a massive step up and made managing my infrastructure much easier.

Why change?

So what changed? What problems did I still want to solve?

Firstly, a lot of my personal infrastructure was still managed ad-hoc (i.e. logging in and installing/updating configuration). Everything from deploying SQL databases, managing IP ranges, VPN configuration, and more.

Also despite using Ansible and Salt, most of the Playbooks and States were not very idempotent. They would work fine on the first run, but a subsequent run would fail.

Finally, after the onboard network interface on one of my home servers (Pink Floyd) started to drop out frequently, it made me paranoid that if one of my home servers failed, I would struggle to return a replacement to the previous state.

All in all, Ansible had helped in solving my biggest pain points, but I couldn’t rely on it if something went wrong.

Rethinking the approach

As happens quite often in tech, a failure usually leads to making improvements. Before it became apparent that the network interface on Pink Floyd was failing, I had assumed that I was reaching the limit of the server’s resources. I’d recently added a few virtual machines (mostly for development testing and playing with eBPF), and soon after I started seeing frequent drop outs in my monitoring.

This prompted me to make a few changes: -

Changing the scrape_interval for some intensive/long-running checks on my infrastructure
Remove a few virtual machines
Move some applications/machines to my other home server (Archspire)
Use containers more where possible

As part of this, I removed a self-hosted Gitlab instance. I originally intended to use Gitlab CI on my infrastructure, but struggled to dedicate the time to it.

I moved all the repositories from the Gitlab instance to my Gitea instance and removed the Gitlab machine. I still wanted to move to using a CI solution, but using the in-built Gitlab approach was no longer an option.

Upgrades

After a little more investigation (and discovering the network interface was at fault), I added a USB NIC to Pink Floyd and it has been running fine ever since. The USB NIC also supports 2.5Gbe, so once I buy a switch(es) that support multi-gig, I can take advantage of the extra bandwidth. For now, this just means I have a functioning interface (which is a big upgrade from an intermittently functioning interface!).

After moving a few applications and machines over to Archspire (and seeing the available memory/disk space dwindle), I decided it was time to give the machine a bit of a boost. I bought 2 16Gb of DDR4 SO-DIMM sticks of RAM and a 512Gb NVME, doubling both the memory and storage on this machine. This also helped solved some issues with a few applications/machines running out of memory (e.g. a Unifi Controller container, the Salt Master server), always a bonus!

Finally, I added two more VPSs for serving external services (i.e. this website, RSS, read-it-later, external monitoring). In addition to my existing Digital Ocean droplet, I now also run two Hetzner Hcloud instances.

Gitea and CI

For those who haven’t used Gitea, it is a self-hosted Git forge similar to GitHub, Gitlab and BitBucket. It is written in Go, and the resource usage is much lower than the self-hosted versions of GitHub, Gitlab or BitBucket. It started as a fork of Gogs but now appears to be more popular.

As already mentioned, I wanted to starting using continuous integration with my infrastructure. Now that I had removed my Gitlab instance, I couldn’t use Gitlab CI. Gitea itself doesn’t have any native continuous integration support out of the box. This is fine if all you need is Git repository hosting, but I wanted to start making use of CI because: -

I wanted to lint and test my Ansible and Salt configuration before applying it
I now write a lot of tools in Go. I want to generate releases I can use across all of my machines (rather than building manually and SCP/rsync-ing them around), as well as lint and test them
I want to take the GitOps approach of updating a repository and the changes happen automatically, making Git the source of truth (and not the infrastructure)
- It is much easier to update a line in a file and commit it than it is to login to a machine(s) to run a series of commands

It also needed to be self-hosted (because this is all running on a local Git instance, not a publicly available one), and ideally quite low in resource usage.

After some searching around (and finding the awesome-gitea repository), I found Drone.

Drone

Drone is a CI/CD platform that uses yaml to define the steps to take when code is committed into your repository. It works in a similar way to the previously mentioned Gitlab CI and GitHub Actions, while not being tied to either platform. It integrates natively with Gitea, and can run jobs in different ways: -

In containers (e.g. Docker)
- For most jobs this makes sense (e.g. building Go tools, running Ansible Playbooks)
Using the exec runner to run commands directly on a machine (for when running in a container isn’t viable)
- A good example of the use case is with Salt, as the Salt Master can tell all Minions to update
Using the ssh runner to run commands on a machine available via SSH (for hosts that do not support containers and can’t run the exec runner).
- Running commands on something like OpenBSD, or a network device, where the agents do not run correctly (or at all in the case of network gear)

There is an enterprise version of Drone that supports ephemeral agents (i.e. the agents spin up when required, rather than running at all times), clustering and more, I don’t require any of these features in my infrastructure, so the OSS version is more than enough for my needs.

Each repository that you want to run tasks on required a .drone.yml file, with a format quite similar to Gitlab CI/GitHub Actions/many other CI solutions: -

kind: pipeline
name: default
type: docker

trigger:
  branch:
    - main

steps:
  - name: submodules
    image: alpine/git
    commands:
    - git submodule update --init --recursive

  - name: spellcheck
    image: tmaier/markdown-spellcheck
    commands:
      - mdspell --ignore-numbers --ignore-acronyms --report "content/**/*.md"
    when:
      event:
       - pull_request
[...]

Each step will run when an event triggers it. For example, you could have a step which performs linting, but only when a pull request is raised. You could also have it deploy a version of your code to a staging/testing environment before it is merged into your main branch, and then roll out to production after it is merged.

Drone also has a number of plugins so that you don’t need to define every step yourself. For example, it already has plugins for Ansible and Hugo, as well as sending notifications to a number of platforms (Slack, Matrix, Telegram etc).

Configuration management

To ensure that all of my Ansible and Salt configuration was ready to be used with a CI platform (Drone in my case) then it needed a huge refactor. The requirements were: -

Drone (or any other CI tool or automated process) can run them
Ansible must use roles
The CI job must run against all hosts and with all the necessary roles
- This ensures all necessary changes go out at once, not when I remember to run them!
Ansible and Salt must be idempotent (i.e. running the Playbooks/States again does not fail or make further changes)
Tasks/actions only run when required (i.e. do not restart a service on every run, only when the configuration/version changes)

My previous posts on Salt and Ansible followed these rules, I just never applied the same rules to my own infrastructure before! (Practice What You Preach)

Containers

I have been using and working with containers for about the past 6 years, and have run a few on my personal infrastructure previously. However it was all very ad-hoc, no consistency, and most of the applications I run were still native packages and/or installed inside virtual machines. Examples of those which I have ran on VPSs/Virtual Machines which could very easily be in containers are: -

Hugo (for hosting this blog)
Read-it-later services
RSS
Netbox
Oxidized

During this overhaul, I moved all of the above services, and added some other services too in the process.

Now the big question people may ask is am I running all of this on Kubernetes? The answer is not yet. I am comfortable with Kubernetes as a platform, having managed and deployed the platform and applications on Kubernetes in multiple companies now. However I don’t feel like my infrastructure is quite ready for it yet (having two home servers goes against the idea of quorum for a start!), and it would require refactoring everything about my personal infrastructure. Given how long it has taken me to find the time to fix all the current issues with my infrastructure, a full refactor isn’t on the cards just yet.

More to come

There is a lot to cover in what has changed, especially on the configuration management side. Rather than try and put this all into one huge post, this will be split out into: -

Moving from basic Ansible Playbooks to using Roles
Improving my Salt configuration files
Managing dependencies and optional tasks/actions in both Ansible and Salt
Setting up Drone
Using Drone with Ansible and Salt
Using Drone with Go to build releases
Using Drone to run tasks on OpenBSD
Using Drone to build this website (including some useful linting/checking)
Managing secrets that can be used by Ansible, Salt and Drone
Additional services that provide benefits to my home and personal infrastructure
What else I want to do in future

A lot of these posts are not specific to my home infrastructure, and should hopefully provide benefit to anyone using Ansible, Salt, Drone, or those who just like reading about other people’s home labs (like me!).