On Building a Blog - Build Pipeline

My blog, like most of my personal projects, are hosted on GitLab. I love pretty much everything about GitLab - the product, the service, the company - so I use their services a lot. You can self-host a GitLab server if you’re so inclined, but I personally use their cloud offering at gitlab.com.

The fundamental service GitLab provides is a central Git repository for syncing your repos between machines, users, etc. but has grown to be so much more than that. One of the really cool things about their service is their integrated continuous integration tools.

GitLab CI

GitLab CI/CD allows you to include a file in your repo that defines a sequence of stages (a pipeline) that need to be completed to build/test/deploy your code, and the details of what each of those stages needs to do.

The build pipeline for my blog would have to be one of the simplest you can have, and it will serve as a great reference for the rest of the article if I just include it up-front:

image: cdbax/hugo_aws-cli:latest

stages:
  - build
  - deploy

hugo build:
  stage: build
  only: 
    - master
  variables:
    GIT_SUBMODULE_STRATEGY: recursive
  cache:
    key: ${CI_COMMIT_SHORT_SHA}-dist
    paths:
      - public
    policy: push
  script: 
    - HUGO_ENV=production hugo -v -s ./
  tags:
    - private

deploy to S3:
  stage: deploy
  only:
    - master
  variables:
    GIT_STRATEGY: none
    GIT_SUBMODULE_STRATEGY: none
  cache:
    key: ${CI_COMMIT_SHORT_SHA}-dist
    paths:
      - public
    policy: pull
  script: 
    - aws s3 sync public s3://cbax.tech/ --sse AES256
    - aws cloudfront create-invalidation --distribution-id ${CLOUDFRONT_DISTRO} --paths "/*"
  tags:
    - private

Before I talk about the script, it’s important to understand how it will be used. When I push my code up to GitLab, their service will see this file, and pass instructions to a “build runner”. That build runner will then look at the instructions, and run each defined job inside a Docker container. The container it will use is also defined in the script.

Lets quickly break down the CI script:

  1. image - Use this specific Docker image for doing the builds
  2. stages - There’s 2 stages, with these names, to be run in this order.
  3. hugo build - This is the job definition for building my site
  4. deploy to S3 - This is the job definition for deploying my site.

The Docker Image

The site I’m building has a couple of key dependencies:

  • It needs Hugo to turn the code into files that make up a website
  • It needs the AWS-CLI to push the files over to S3 securely

With a bit of quick Googling I found a Docker Image that was very close to what I wanted already, but not quite right. The image I found included htmlproofer in order to run tests over the built site, and as a result it used a Ruby image as it’s base. I didn’t want htmlproofer, and therefore I didn’t need Ruby. I also DID need AWS-CLI, which wasn’t part of this image.

The solution was to build my own image, but it was made a hell of a lot easier by having an existing thing to copy. The image was based on a Ruby-Alpine image which was great, as I planned to use Alpine as my base. I could just copy all the commands over, and I’d have an image that could run Hugo.

Installing the AWS-CLI can be a pain, but I found an Alpine package in the Edge repository, so I was able to get away with a one-liner:

# Install AWS-CLI from Alpine's edge repository
RUN apk add aws-cli --no-cache --repository http://dl-3.alpinelinux.org/alpine/edge/testing/ --allow-untrusted

The fact that it’s not in a main repo, and the need to do --allow-untrusted would concern me and warrant further investigation if this was anything other than a blog…but I won’t lose any sleep over the security of this particular pipeline.

The Build Jobs

The “hugo build” job does exactly that. It only does it when changes are made to the master branch, which is 😱 the only branch I’m using. I’m in the habit of specifying this in my CI scripts, but I guess in this case it could allow me to use a different branch if I was commiting experiments that I didn’t want built. You’ll also note the “private” tag on each of the jobs. This allows me to only run this job on runners I’ve tagged “private”. GitLab has a heap of public shared runners available, but I prefer to only run my build jobs on runners I control.

One of the problems I ran into when I first built my site was that it wasn’t including my Hugo theme. I’ve added my theme as a Git submodule, as I can then update it easily if changes are made on the theme’s repository. It turns out that GitLab CI, by default, won’t pull down submodules. That’s what the GIT_SUBMODULE_STRATEGY variable does - tells the runner to pull down any submodules, including any submodules inside submodules.

The cache is how you share files between different stages in your pipeline. In this case, Hugo will spit the files out into the “public” directory, so I cache that. I also use the commit hash in the cache key, as all jobs in one pipeline will have that hash, but other pipelines won’t. I want my Build stage to push to the cache, and my Deploy stage to pull from the cache.

The key difference in the Deploy stage, is the GIT_STRATEGY variable. Given I just want this stage to push the files I’ve already built up to S3, and given they are already in the cache, I don’t actually want this stage to pull down the Git repo at all. That’s what this variable does.

One last thing worth mentioning is the Environment Variables - or specifically where they’re coming from. You can define environment variables in your GitLab project, and have them passed to the runner when it starts a job. In this case I’ve manually defined a “CLOUDFRONT_DISTRO” variable, and there’s a couple of others not shown in the script that the AWS-CLI uses for authentication. The “CI_COMMIT_SHORT_SHA” one used in the cache keys is actually one of a list of dynamic variables that GitLab makes available to each job by default.

Thanks for reading. Toodles!