Home and Personal Infrastructure Overhaul: Part 3 - Improving my Salt environment :: YetiOps — A view on the tech and open source world from a hairy human

The next part in this series of posts on how I overhauled the management of my Home and Personal infrastructure is on the changes to my Salt environment.

As mentioned in the last post, I decided to split my use of Ansible and Salt into: -

Ansible

Bootstrapping nodes with the Saltstack agent and my SSH keys
Taking a nodes SSH key and adding it to Gitea/Gitlab/GitHub
Managing DNS and DHCP for hosts in my network (pools, static reservations, DNS entries etc)
Adding certain hosts to Cloudflare DNS so that I can use Lets Encrypt on my applications
Adding IPs/descriptions to Netbox for IPAM

Salt

Pretty much everything else!

Ansible is great as an early-stage/node preparation tool, as it doesn’t rely on any Agents being applied. Once this is all done, Saltstack can takes over and manage what runs on nodes that have agents installed, what Prometheus exporters (i.e. monitoring agents) run alongside the applications, and more.

This series goes through what has changed in my home infrastructure, and more importantly why it needed to change. While my Ansible setup was rewritten from scratch mostly (new repository, new roles, different variable structures etc), my Salt environment didn’t need quite as much work.

If you already use Salt, or want to, hopefully some of this will help you in improving your Salt environment.

What needed to change?

When I started using Saltstack on my home infrastructure, I was already using it in a professional capacity too. While this doesn’t guarantee that everything will be perfect, I had a solid grounding to build upon.

However I wasn’t entirely happy with what I had put together. The main issues were: -

Very little was managed with Salt
- I only had Consul installation and configuration, some base packages and installing/enabling LLDP
Consul pillar data (i.e. variables that applied to the hosts) was required to be defined up front for every host
Services would restart whenever Salt ran, even if nothing changed
The Pillars that defined what exporters Consul would add to it’s service list was manually defined
Consul binaries were stored in the Git repository directly, rather than sourced from Hashicorp’s repositories or releases
States across different operating systems (e.g. Ubuntu/Debian Linux, Alpine and OpenBSD) differed slightly, and yet were repeated in their entirety

Why did I want to change this? I wanted to move to continuous integration for my infrastructure. Also, I wanted Git to be a source of truth for how my applications were defined. If every run restarts services, and every new host needs additional configuration, it makes the process less declarative and more reactive (and impactful).

Also, as I was migrating services from Ansible to Salt, I didn’t want to inherit some of the same issues I had with the Ansible playbooks (non-repeatable tasks, locally-stored binaries, dependency issues). If I could get Salt into a good state, I would have a solid blueprint for adding more services to it.

The issues

First I’ll outline all the issues, and then I’ll go through how I approached them.

Host-specific pillars

My pillar top.sls file (the file which defines what pillars are tied to what host) had the following definition: -

base:
  '*':
    - consul.{{ grains['nodename'] }}

The asterisk (as you might be able to guess) means that this applies to every host. The {{ grains['nodename'] }} placeholder that refers to the name of the host. The below shows what the nodename is for some of my hosts: -

# salt '*' grains.item nodename
archspire.noisepalace.home:
    ----------
    nodename:
        archspire
git-01.noisepalace.home:
    ----------
    nodename:
        git-01
vpn-01.noisepalace.home:
    ----------
    nodename:
        vpn-01
ns-03:
    ----------
    nodename:
        ns-03
pinkfloyd.noisepalace.home:
    ----------
    nodename:
        pinkfloyd
dev-02.noisepalace.home:
    ----------
    nodename:
        dev-02

What this means is that Salt will look for a file in the directory consul called $NODENAME.sls (e.g. vpn-01.sls, dev-02.sls). If a file does not exist for every node, Salt will throw an error, because the Pillar top.sls says that every host requires one.

This is an issue, as it means that when a brand new host is added, a file must be created that contains all the variables it needs. The point in configuration management is so that many hosts can be configured and prepared en masse. Defining a file for every single host goes against the principles of automation!

Why did I set up the pillars like this in the first place? The reason is because I run Consul on my VPSs (for Prometheus exporter service discovery). As I don’t want to make my Consul servers publicly available, I needed to ensure the interface it would listen on (in my case, Wireguard tunnel interfaces). There are many ways to do this, but I chose the “easiest” rather than the most scalable. Hindsight is a wonderful thing!

Hard-coding data in Pillars

If we take a look at the old Pillar data, we can see the data Consul would use for binding to a specific interface: -

# vps-shme.sls
consul:
  bind_int: wg0
  prometheus_services:
    - node_exporter
    - wireguard
    - iptables
    - apache
    - mysql
    - icmp

# ns-03.sls
consul:
  bind_int: enp6s0
  prometheus_services:
    - node_exporter
    - named
    - pihole
    - dhcp
    - icmp

The variable bind_int was used in the Consul configuration template when deploying Consul to a node. It ensured Consul would bind to the correct interface. This required defining the pillar before we can run Salt, otherwise Salt would fail to run.

If we take a look at the Consul configuration template itself, we can see where this Pillar data was used: -

datacenter = "noisepalace"
data_dir = "/opt/consul"
encrypt = "$TOKEN"
retry_join = ["192.168.0.6"]
{%- if pillar['consul'] is defined %}
{%- if pillar['consul']['bind_int'] is defined %}
{%- set interface = pillar['consul']['bind_int'] %}
bind_addr = "{{ grains['ip4_interfaces'][interface][0] }}"
{%- endif %}
{%- endif %}

node_meta {
  env = "noisepalace" 
}

performance {
  raft_multiplier = 1
}

This takes the interface we supply in our Pillar data, and then derives the IP address of this interface from the host’s grains (the host facts).

We also have a hard-coded list of what prometheus_services run on this machine. If one of the exporters is removed, it also needs to be removed from here. If I forgot to add the exporter to the list, it would never be monitored. This variable was used in the Consul State file (states/consul/init.sls) like so: -

{% if 'prometheus_services' in pillar['consul'] %}
{% for service in pillar['consul']['prometheus_services'] %}
/etc/consul.d/{{ service }}.json:
  file.managed:
    - source: salt://consul/files/{{ service }}.json
    - user: consul
    - group: consul
    - mode: 0640
    - template: jinja
{% endfor %}
{% endif %}
{% endif %}

consul_reload:
  cmd.run:
    - name: consul reload

For every service in the list, we would add the JSON file into the correct Consul directory, and then at the end we would reload Consul. While this task statement is okay, it is reliant on hard-coded data.

Needless restarting/reloading of services

As I didn’t use Salt for many things, or run it often, the odd service restart wasn’t an issue. However the more I moved to it, the more applications would restart without good reason. If every minor configuration change required restarting almost every application on a machine, the machines would have huge CPU/memory/disk usage spikes, as well as taking the services down.

As an example, the below reloads Consul, whether anything has changed or not: -

consul_reload:
  cmd.run:
    - name: consul reload

Also a lot of the Ansible Playbooks I wanted to migrate (particularly exporter installation) would restart services, Prometheus, exporters and more every time they ran, e.g.: -

- name: Daemon Reload and run prometheus
  systemd:
    state: restarted
    enabled: yes
    daemon_reload: yes
    name: prometheus

Binaries stored in the repository

Storing binaries in Git (unless you are using something like Git LFS is not recommended. The problem with having binaries stored in Git (or other version control repository) are: -

Git is geared towards showing changes, which is difficult and not very useful on binary files
The repository size is significantly larger compared to storing text/configuration/code
The version of the binary is static until you download a new version and commit it to the repository

Bear in mind that Git stores all versions of objects committed. If over time you have multiple versions of a binary (even if they have the same name) in the history, a clone of the repository in future would include all versions of that binary too.

The following is how Consul was installed previously: -

/usr/local/bin/consul:
  file.managed:
    {% if grains['os'] != 'Alpine' and grains['os'] != 'OpenBSD' and grains['cpuarch'] == 'x86_64' %}
    - source: salt://consul/files/consul_amd64
    {% elif grains['os'] != 'Alpine' and grains['os'] != 'OpenBSD' and 'arm' in grains['cpuarch'] %}
    - source: salt://consul/files/consul_arm
    {% endif %}
    - user: root
    - group: root
    - mode: 755

Here there is a version of Consul for x86-64 machines, and one for arm, both stored in the repository.

Similar tasks repeated for different systems

In my network I run a mixture of Debian, Ubuntu, Alpine, OpenBSD and a single Arch Linux machine (mainly for development/testing).

Debian and Ubuntu are similar enough that most states for Debian work with Ubuntu. However Arch, Alpine, and OpenBSD especially, differ significantly. Arch uses a different package manager (and some different locations for some binaries), Alpine uses a different init system and service manager, and OpenBSD is an entirely different operating system altogether.

To cater to these, many states were repeated but with minor differences. For example: -

{% if grains['os'] != 'OpenBSD' %}
/opt/consul:
  file.directory:
    - user: consul
    - group: consul
    - mode: 755
    - makedirs: True

/etc/consul.d:
  file.directory:
    - user: consul
    - group: consul
    - mode: 755
    - makedirs: True

[...]
{% if grains['os'] == 'OpenBSD' %}
/opt/consul:
  file.directory:
    - user: _consul
    - group: _consul
    - mode: 755
    - makedirs: True

/etc/consul.d:
  file.directory:
    - user: _consul
    - group: _consul
    - mode: 755
    - makedirs: True
[...]

The only difference in the above is that the user and group are prefixed by an underscore (_) on OpenBSD, and yet we have entirely separate tasks to create directories and add files.

If I need to change the directory that Consul stores data in or it’s configuration, I would have had to change it in multiple places. This often leads to forgetting to change it in all places and breaking at least one of the systems used.

The Changes

While the above issues were significant, they didn’t require a complete refactor (unlike my Ansible repository).

Host-specific pillars

I wanted to move away from host-specific pillars for every machine. This doesn’t mean that hosts cannot have pillars that are specific to them, just that by default they are only required if they differ from the defaults. Having sensible defaults rather than no defaults can dramatically reduce complexity and manual configuration work.

I approached this by making better use of custom grains.

Grains are like facts in Ansible. Salt derives the facts from the host, rather than defining them for the host. For example, the CPU architecture of a machine is derived from the host, as is what OS it is running.

As mentioned already, I use Ansible to bootstrap hosts and to deploy the Salt Minion application. A few minor changes in my Ansible code would help with solving this issue.

Previously, the Salt Minion configuration template looked like the below: -

master: salt-master.noisepalace.home
id: {{ ansible_fqdn }}
nodename: {{ ansible_hostname }}
grains:
  nodename: {{ ansible_hostname }}

We had a single custom grain here (which overrides the discovered nodename grain), but nothing more. The below shows the changes I made to the template: -

master: salt-master.noisepalace.home
id: {{ ansible_fqdn }}
nodename: {{ ansible_hostname }}
grains:
  nodename: {{ ansible_hostname }}
{% if 'vps' in ansible_hostname %}
  bind_int: wg0
{% else %}
  bind_int: {{ ansible_default_ipv4['interface'] }}
{% endif %}
  groups:
{% for group in group_names %}
    - {{ group }}
{% endfor -%}

I added the bind_int grain, and also the groups grain. The bind_int grain will contain the primary interface on a machine (i.e. the one which has a default route) unless it is one of my VPSs. If it is a VPS, the bind_int grain will always be the Wireguard tunnel interface (wg0).

I then updated the Consul configuration template to remove the reliance on Pillar data: -

datacenter = "noisepalace"
data_dir = "/opt/consul"
encrypt = "$ENCRYPTION_TOKEN"
node_name = "{{ grains['nodename'] }}"
retry_join = ["192.168.0.6"]
{%- set interface = grains['bind_int'] %}
bind_addr = "{{ grains['ip4_interfaces'][interface][0] }}"

node_meta {
  env = "noisepalace"
}

performance {
  raft_multiplier = 1
}

The Consul pillar data for each host is no longer required for it to be installed and configured correctly.

We still need to do something about the Prometheus services that are registered with Consul, but I will cover that in a later section.

In addition, I added the Ansible groups that each host is in as grains. This means that I can define a host as having a role, and then install packages/states/configuration based upon the role. A example of this is below: -

base:
[...]
  'groups:vmh':
    - match: grain
    - vmh
    - exporters.libvirt_exporter
    - exporters.syncthing_exporter
    - docker
    - syncthing

The above says that if a host is in the vmh group, we will install the vmh states (which installed KVM/QEMU and related services), Docker, Syncthing (for syncing ISOs between hosts) and related exporters for monitoring.

I also use this to say whether the qemu-guest-agent package should be installed on a machine, based upon it being in the vmh-guest group: -

base_packages:
  pkg.installed:
  - pkgs:
    - lldpd
    - jq
    - rsync
    - wget
[...]
{% if 'vmh-guests' in grains['groups'] %}
    - qemu-guest-agent
{% endif %}
[...]

This means the qemu-guest-agent package is not installed on the hypervisors and VPSs, only those that need it (i.e. my virtual machines).

Hard-coding data in pillars

I am no longer hard-coding which interface Consul will bind to (using custom grains). The next improvement is to remove the hard-coded list of exporters to register with Consul (for Prometheus to discover and monitor).

Part of the issue previously is that because all of the exporters were installed by Ansible, Salt did not know what was already installed on a machine. However now the exporters are deployed with Salt, we can also include the relevant file to register them with Consul: -

## Wireguard exporter 
[...]
/etc/consul.d/wireguard.json:
  file.managed:
    - source: salt://consul/files/wireguard.json
    - user: consul
    - group: consul
    - mode: 0640
    - template: jinja

consul_reload_wireguard:
  cmd.run:
    - name: consul reload
    - onchanges:
      - file: /etc/consul.d/wireguard.json

## Pihole exporter 
[...]
/etc/consul.d/pihole.json:
  file.managed:
    - source: salt://consul/files/pihole.json
    - user: consul
    - group: consul
    - mode: 0640
    - template: jinja

consul_reload_pihole:
  cmd.run:
    - name: consul reload
    - onchanges:
      - file: /etc/consul.d/pihole.json

These types of tasks are included in the State file for each exporter. If an exporter is installed, it gets registered with Consul automatically. This removes the reliance Pillar data for services Prometheus needs to monitor, meaning no Consul pillar needs to be defined for every host.

This isn’t to say that no Consul pillar data exists at all though. Some applications already have a metrics/monitoring endpoint (i.e. no exporters are required) that exposes metrics in the correct Prometheus format. In this case, we still supply a Consul pillar like so: -

## git-01.sls
consul:
  prometheus_services:
    - gitea

## pinkfloyd.sls
consul:
  prometheus_services:
    - traefik
    - alertmanager

However unlike before, this doesn’t cover every possible Prometheus service, just additional services. If I show what services are running on pinkfloyd for example: -

$ consul catalog services -node pinkfloyd
alertmanager
cadvisor
consul
icmp
libvirt
mikrotik_exporter
nextcloud
nginx
node_exporter
pihole
redis
speedtest
syncthing
traefik
wireguard

All but traefik and alertmanager are not dependent upon Pillar data.

Restarting/reloading of services

To avoid restarting/reloading services when nothing has changed, I used three different directives in Salt: -

onchanges - This will run a task only if a prerequisite task has changed
watch - This is very similar to onchanges, except that for certain kinds of tasks additional actions may be taken
require - This only requires that a task completed successfully, as opposed to a previous task completing and making changes

A good example of require is ensuring a package has been installed before trying to configure it.

If a package fails to installed, then the files it would install (e.g. service files, configuration, directories) will not exist. If a subsequent task tries to put additional files into the directory that the package installation would have created, it will also fail. With the require keyword, we ensure that tasks with hard dependencies do not run without their prerequisites completing first.

The onchanges directive means that the task will only run if a prerequisite task made any changes. This is useful for things like reloading Consul if a file was added/changed in it’s configuration directory, or performing a systemctl daemon-reload if the SystemD unit file changes.

A good example with both require and onchanges is below: -

test_prometheus_config:
  cmd.run:
    - name: "/usr/local/bin/promtool check config /etc/prometheus/prometheus.yml"
    - onchanges:
      - file: /etc/prometheus/prometheus.yml
      - file: add_prometheus_alerts

prometheus_reload:
  cmd.run:
    - name: curl -X POST localhost:9090/-/reload
    - onchanges:
      - file: /etc/prometheus/prometheus.yml
      - file: add_prometheus_alerts
    - require:
      - test_prometheus_config

In the above, we use promtool (a Prometheus utility CLI tool) to check that our Prometheus configuration is correct. The promtool command will also check Alertmanager rules by default. This uses onchanges to make sure it only runs if there have been updates to the /etc/prometheus/prometheus.yml task or the add_prometheus_alerts task.

If the test_prometheus_config task runs without errors, then we trigger a reload of Prometheus via the API. If the test_prometheus_config task fails, Prometheus does not reload. If neither the Prometheus configuration or Alertmanager rules have changed, no reload takes place.

This not only ensures that Prometheus only reloads when it’s configuration files changes, but has added benefit of making sure they are syntactically correct.

A quick point to the note is that Salt tasks can have custom names (e.g. add_prometheus_alerts) or the name can be the file/service being changed (e.g. /etc/prometheus/prometheus.yml). The onchanges directive must refer to the name of the task, whether that is a custom name, or matches the file/service being acted upon.

`watch` versus `onchanges`

The onchanges directive instructs Salt to run the task if a preceding task changed. In the above we can see that this runs commands on the machines in question to test/reload Prometheus.

Take the below as an example: -

prometheus_service:
  service.running:
    - name: prometheus
    - enable: True
    - onchanges:
      - file: /etc/systemd/system/prometheus.service

When Prometheus is installed, this will ensure the service is running and is enabled. If the service file changes though, the task doesn’t do anything because the service is still running (unless another process/user stopped it) and is still enabled.

What watch does is for certain types of tasks, it will perform additional actions. If we change the above to this: -

prometheus_service:
  service.running:
    - name: prometheus
    - enable: True
    - watch:
      - file: /etc/systemd/system/prometheus.service

Salt will now perform the additional action of restarting the service if the file it is watching changes. Not every type of task has an additional watch behaviour, but in this case it helps to cut down on the amount of tasks we need to define.

Binaries stored in the repository

This is a pretty straightforward change to make, especially for Consul. Hashicorp now make a lot of their products available via Apt and RPM repositories. This means all my Debian and Ubuntu machines can now use the repositories, and get updates when they are available (not when I remember to update the binary in my Git repository).

This doesn’t help with my OpenBSD, Alpine and Arch machines. For them, I use what is in their package repositories. All three do keep relatively up-to-date packages for Consul, whereas the versions in the Debian/Ubuntu archives are usually months/years old.

To move to this approach, we use: -

{% if grains['os_family'] == 'Debian' %}
consul_repo:
  pkgrepo.managed:
    - humanname: Hashicorp
    - name: "deb https://apt.releases.hashicorp.com {{ grains['oscodename'] }} main"
    - dist: {{ grains['oscodename'] }}
    - file: /etc/apt/sources.list.d/hashicorp.list
    - gpgcheck: 1
    - key_url: https://apt.releases.hashicorp.com/gpg
{% endif %}

{% if grains['os_family'] == 'RedHat'  and grains['os'] != 'Fedora' %}
consul_repo:
  pkgrepo.managed:
    - humanname: Hashicorp
    - file: /etc/yum.repos.d/hashicorp.repo
    - baseurl: https://rpm.releases.hashicorp.com/RHEL/$releasever/$basearch/stable
    - gpgcheck: 1
    - gpgkey: https://rpm.releases.hashicorp.com/gpg
{% endif %}

{% if grains['os_family'] == 'RedHat' and grains['os'] == 'Fedora' %}
consul_repo:
  pkgrepo.managed:
    - humanname: Hashicorp
    - file: /etc/yum.repos.d/hashicorp.repo
    - baseurl: https://rpm.releases.hashicorp.com/fedora/$releasever/$basearch/stable
    - gpgcheck: 1
    - gpgkey: https://rpm.releases.hashicorp.com/gpg
{% endif %}

{% if grains['os_family'] == 'Suse' %}
Virtualization_containers:
  pkgrepo.managed:
    - humanname: "Virtualization:containers (openSUSE_Tumbleweed_and_d_l_g)"
    - baseurl: https://download.opensuse.org/repositories/Virtualization:/containers/openSUSE_Tumbleweed_and_d_l_g/
    - gpgcheck: 1
    - gpgkey: https://download.opensuse.org/repositories/Virtualization:/containers/openSUSE_Tumbleweed_and_d_l_g/repodata/repomd.xml.key
    - autorefresh: 0
    - gpgautoimport: True
{% endif %}

consul_package:
  pkg.installed:
  - pkgs:
    - consul
  - refresh: True

[...]

As you can see, I also have declarations for RHEL-based machines and SuSE. While I don’t currently run them on my infrastructure, it is worth having these for if and when I do need them. I wanted to be agnostic as I can be to what systems I need in my infrastructure, allowing me to choose the best systems for each role.

The last task referenced above installs Consul, and refreshes the package repositories so that it will use the latest package available. This ensures that we install from the Hashicorp repository after it is added (to the systems that repositories exist for). This will also use the native package installer on each system, whether that is Apt, DNF, Pacman, Zypper, APK, pkg_add or anything else.

In the cases where official package repositories are not available (e.g. most exporters, Prometheus and more), we can use this approach instead: -

{% if 'x86_64' in grains['cpuarch'] %}
retrieve_bind_exporter:
  cmd.run:
    - name: wget -O /tmp/bind_exporter.tar.gz https://github.com/prometheus-community/bind_exporter/releases/download/v{{ pillar['bind_exporter']['version'] }}/bind_exporter-{{ pillar['bind_exporter']['version'] }}.linux-amd64.tar.gz
{% endif %}
{% if 'aarch64' in grains['cpuarch'] %}
retrieve_bind_exporter:
  cmd.run:
    - name: wget -O /tmp/bind_exporter.tar.gz https://github.com/prometheus-community/bind_exporter/releases/download/v{{ pillar['bind_exporter']['version'] }}/bind_exporter-{{ pillar['bind_exporter']['version'] }}.linux-arm64.tar.gz
{% endif %}

extract_bind_exporter:
  archive.extracted:
    - name: /tmp
    - enforce_toplevel: false
    - source: /tmp/bind_exporter.tar.gz
    - archive_format: tar
    - user: root
    - group: root

{% if 'x86_64' in grains['cpuarch'] %}
/usr/local/bin/bind_exporter:
  file.rename:
    - name: /usr/local/bin/bind_exporter
    - source: /tmp/bind_exporter-{{ pillar['bind_exporter']['version'] }}.linux-amd64/bind_exporter

delete_bind_exporter_dir:
  file.absent:
    - name: /tmp/bind_exporter-{{ pillar['bind_exporter']['version'] }}.linux-amd64
{% endif %}

{% if 'aarch64' in grains['cpuarch'] %}
/usr/local/bin/bind_exporter:
  file.rename:
    - name: /usr/local/bin/bind_exporter
    - source: /tmp/bind_exporter-{{ pillar['bind_exporter']['version'] }}.linux-arm64/bind_exporter

delete_bind_exporter_dir:
  file.absent:
    - name: /tmp/bind_exporter-{{ pillar['bind_exporter']['version'] }}.linux-arm64
{% endif %}

delete_bind_exporter_files:
  file.absent:
    - name: /tmp/bind_exporter.tar.gz
{% endif %}

This does pin the exporters to a version rather than having them on the latest, but it does at least ensure that I am not storing an old version in a repository and using that.

Similar tasks repeated for different systems

One of the issues I found with having tasks that are almost identical but with one field/option different (e.g. a different username/group) is that I would often make updates to one and not the other, and then wonder why my changes were not taking place.

Unlike Ansible, Salt generates the state files on the fly, allowing you to use Jinja2 templating syntax inside of them. How does this help us? Lets see: -

{% if grains['os'] == 'FreeBSD' %}
/usr/local/etc/consul.d/consul.hcl:
{% else %}
/etc/consul.d/consul.hcl:
{% endif %}
  file.managed:
    - source: salt://consul/files/consul.hcl
{% if grains['os'] == 'OpenBSD' %}
    - user: _consul
    - group: _consul
{% else %}
    - user: consul
    - group: consul
{% endif %}
    - mode: 0640
    - template: jinja

The first conditional evaluates based upon whether the machine running this task is a FreeBSD machine or not. FreeBSD uses a different location for user-installed binaries and configuration files (usually prefixed with /usr/local). With the conditional in place, we can use a different filename based upon the system.

After this, we also have a conditional to say that if we are on OpenBSD, the user and group for Consul is _consul. For any other system, the user and group would be consul.

Rather than potentially 3 different tasks (a standard task, FreeBSD with a differing filename, OpenBSD with different user/groups), we have a single task with some conditionals in place. Updates to this task will now apply to all machines, not just the ones I remember to update!

What else?

So after mitigating all these issue, what else needed to be done? Move as much to Salt as possible!

As mentioned, previously I only installed and managed Consul with Salt. Now, I manage the following: -

Name	Purpose
alertmanager	Receive alerts from Prometheus and send them to Slack
apache_exporter	For monitoring Apache2
base	Install base utilities
bind_exporter	For monitoring Bind9
blackbox_exporter	For monitoring ICMP/HTTP(S)/TCP connections
cadvisor	For container monitoring
consul	Install Consul and any relevant services defined in Pillars
dhcp_exporter	For monitoring isc-dhcp-server
docker	Install Docker, docker-compose, and customise the Docker daemon configuration
gitea	Gitea version control installation and configuration
gitea-release	For deploying my Gitea Release Golang application to manage releases)
iptables_exporter	For monitoring traffic usage on machines with iptables installed
libvirt_exporter	For monitoring KVM machines
lldp	For adding LLDP (Link Layer Discovery Protocol) to all my machines
mikrotik_exporter	For monitoring my home router
mysqld_exporter	For monitoring MySQL
nextcloud_exporter	For monitoring my Nextcloud instances
nfs	For enabling NFS on servers that need it
nginx_exporter	For monitoring Nginx
node_exporter	For monitoring host metrics and custom script output
oxidized	Adds Oxidized for backing up my network configs
pihole_exporter	For monitoring my PiHole servers
plex_exporter	For monitoring my Plex setup
postgres_exporter	For monitoring PostgreSQL
print	Manages CUPs on my Raspberry Pi Print Server
prometheus	Installs and manages Prometheus and Alert rules
promtail	For sending logs to Loki
rclone	For backing up various data to Backblaze
redis_exporter	For monitoring my Redis instances
salt	For managing the Salt Master configuration
samba	For enabling Samba on machines that require it
snmp_exporter	For monitoring my other network equipment (Unifi, ZyXEL and my FS.com switch)
snmp	For enabling the SNMP daemon on some servers
speedtest_exporter	For monitoring my Internet bandwidth
sshkeys	For deploying my sshkeys Golang application to all my machines
syncthing	For deploying Syncthing to machines that require it
syncthing_exporter	For monitoring my Syncthing instances
traefik	For deploying a Traefik container to machines running Docker services
unpoller	For more in-depth monitoring of my Unifi APs
vaultsql	For deploying my VaultSQL Golang application that manages my FreeRADIUS users in MySQL (for use with 802.1x on my wireless networks)
vmh	For deploying KVM to machines that require it
wireguard	For deploying Wireguard to a number of machines that require tunnels between them
wireguard_exporter	For monitoring my Wireguard tunnels

I have also integrated Hashicorp’s Vault for secret/token management. This means I can avoid committing sensitive values to Git repositories. I will go into more information on this in a future post, including how to use it with Ansible, Salt, Drone and using the Go SDK.

What still needs work?

There are still a few things I want to tackle in my Salt setup, some minor, some quite major.

Some of my state files could be improved. For example, the states I use to deploy the Prometheus node_exporter need work. These separate files (with very similar tasks) for Linux systems using SystemD, Alpine and OpenBSD. As shown already, these can be consolidated quite easily. However I essentially used the same code as laid out in my series on deploying Prometheus and Consul with Saltstack.

Not all applications I use are managed by Salt yet. Most are, but I haven’t tackled FreeRADIUS, monitoring OpenBSD package upgrades, and a few other small pieces. These aren’t huge, but they do need to be managed.

Finally, I want to make more use of Custom Grains. The grains I have now are fine, but the more I have, the less static host definitions/wildcard matches are required in my top.sls files. Examples could include deploying the Salt master, deploying Print servers, what requires Wireguard tunnels and more. This would allow me to treat all my machines as having roles instead. This simplifies configuration, and makes it easy to add new hosts with similar/the same roles.

Bonus Section: Why not Salt Masterless instead of Ansible?

When I released the last post, I got a question on the Admin Admin Podcast Telegram Channel as to why I don’t use Salt Masterless instead of Ansible. This a valid question, especially as it would mean only managing one type of codebase. I have a few reasons for this.

Firstly, the process to use Salt Masterless still requires bootstrapping Salt onto a machine.

We also need to add additional configuration to Salt on a machine to tell it to use Masterless mode (see the referenced documentation for what needs to be done)

Additionally, the Salt configuration must be available to the machine (by cloning it locally, or defining a Git source in the Salt configuration).

Finally, Salt Masterless works the opposite way to Ansible. Rather than Ansible telling multiple machines what to do, Salt Masterless requires running Salt commands directly on each host to update their configuration. If you are familiar with Puppet (and running puppet agent -t in Cron/service files periodically), this is the same sort of model.

To make this even less desirable, the machines would then need their Salt configuration replaced at the end of the bootstrap process to now use Salt in non-Masterless mode.

If anything, the work to use Masterless mode requires more steps than using non-Masterless!

Masterless mode makes a huge amount of sense in a Cloud-like environment, using something like Cloud-Init/Cloud-Config to install Salt on the machines, and then point Salt to Git as a source of it’s state files. I wouldn’t hesitate to use it in that sort of scenario.

With my infrastructure though, Ansible makes more sense. Short of adding a user with the correct privileges to the machine so that Ansible can run the tasks it needs, I don’t need to do anything else. It also fits much better into using continuous integration, as Salt Masterless would require logging to every host and running commands. Ansible manages all the host interaction, it just needs a valid user and credentials/SSH key available.

Summary

The updates made to Salt, and turning it into (mostly) the source of truth of what runs on my infrastructure (that isn’t in a container) has made managing my personal infrastructure a lot nicer.

My applications are up to date, the configuration is version controlled for the applications, and now I can trust what is in Git is also what is running.

It is also now in a state that if no changes are made in my Salt code, no changes are seen when running Salt. Services aren’t restarted every time, configuration is repeatable, everything runs clean! This is exactly what I was aiming for, paving the way to using a Continuous Integration system to manage and deploy future changes.

The next post in this series will cover getting started with Drone, the different kinds of “runners”, and the Drone files that contain the actions to take. Once this is done, there will be subsequent posts on how I integrated Ansible and Salt with Drone, how I build Go binaries, and how I deploy this website that you’re reading this post on using Drone too!

Home and Personal Infrastructure Overhaul: Part 3 - Improving my Salt environment

What needed to change?

The issues

Host-specific pillars

Hard-coding data in Pillars

Needless restarting/reloading of services

Binaries stored in the repository

Similar tasks repeated for different systems

The Changes

Host-specific pillars

Hard-coding data in pillars

Restarting/reloading of services

watch versus onchanges

Binaries stored in the repository

Similar tasks repeated for different systems

What else?

What still needs work?

Bonus Section: Why not Salt Masterless instead of Ansible?

Summary

Next post

`watch` versus `onchanges`