Zero Downtime Deployment with Octopus – an easy way
Using a standard advanced infrastructure of Sitecore, with two content delivery servers and a Load balancer without sticky sessions (you have to use the session DB then!) a zero downtime deployment might be an easy task with octopus. Having these prerequisites in view, you already see that the infrastructure is important. Having the requirement of zero-downtime deployment in mind makes it much easier to choose the best infrastructure at the beginning of a project.
First, the Load balancer has to check the health of each server. Best practice is a file that must be accessible and has a specific content like “Server UP”. If the file is missing or the expected content not in it, the remaining sessions are switched to the remaining server.
Deployment Process in Octopus
Second, the octopus process has to be a “rolling deployment”. This means for the process step that every machine in the environment is handled one by one instead of all at the same time. It is crucial that all the steps for such a deployment are child steps. The rolling deployment setting is for one step and its child steps. Since Octopus 3.5.7 it is possible to move steps in or out of a group. For older octopus versions, you must rebuild the whole process.
The documentation from octopus tells about the rolling deployment: “In load balanced scenarios, this allows us to reduce overall downtime.” But isn’t a zero downtime deployment from visitors’ perspective possible using rolling deployment?
A Simplified Zero-Downtime Process
All these steps are in a group to be able to use the rolling deployment configuration in the parent step and do all this on one machine after another.
First take the current machine out of the load balancer. Just change the content of the check file or even easier, move or delete it, using a custom PowerShell step. After that, wait some time to give the load balancer the chance to move the remaining sessions to re remaining server(s). Then do your deployment magic.
I recommend checking, if the finally deployed machine works correctly. Minimum is a website health check, making sure the site responds with a 200 http code. But also, complex frontend and functional tests are possible.
Be sure to call the website on the current machine by setting a host entry or using unique hostnames for each server. Otherwise the load balancer handles the requests and they are successful handled by the other server(s). This initial call might be enough to warm up the caches so the first visitors wouldn’t notice that they are on a newly deployed website.
If everything is alright, undo the first step: make the machine available again for the health check of the load balancer. Another break is recommended, then the process starts on the next server.
For example, in Sitecore you might change your implementation in a way that makes items in the database necessary. You should be sure, that your code still works when items aren’t there yet.
The biggest problem might be changes of existing fields (e.g. field types) when our old implementation already gets the changed field and crashes. Keeping this in mind, might be essential in the whole Application Lifecycle.
A failed deployment for one machine isn’t a big problem, because all visitors get the website from the other server(s). In the worst case you can deploy the previous release to the server and bring it back online.