Ops Build vs. Buy

Should you build your ops team from within, or hire from outside your organization?

Jose Sierra
Tapad Engineering

--

Defining Ops

Sure, I could have been more specific in the title. “Ops” is intentionally vague because I do not want the spirit of this article to be taken over by a DevOps vs. TechOps vs DevSecOps discussion. There are enough articles, videos, etc. on that topic — even more if you are in the group of people that add SRE to this space (currently, I am not).

For the purposes of this article, we will consider that in addition to implementing proper automation for platform-wide solutions, Ops carries the responsibilities of ensuring stable provisioning of monitoring infrastructure and tools throughout the SDLC pipeline.

With “Build vs. Buy”, I am referring to whether Ops as a function should be grown organically from within, or sourced externally — and where I see it going from here.

I’ve been in technology much longer than I like to admit, but certainly long enough to appreciate all the benefits that we receive when Ops is done right.

Engineers today are benefitting from the formalization/advancement of Ops as a domain, to the point that it is certainly possible to focus all product engineers on business logic and have a highly opinionated, well-informed set of engineers that focused on the actual function of engineering and production tools as their domain. In the end, that is what it’s about. Ops are the engineers that focus on the function of engineering. The direct stakeholders on Ops engineering are product engineers that rely on them to provide:

  1. the environment(s) for their code to be built, deployed, run, monitored, and maintained.
  2. code repository
  3. artifact repository
  4. build/deploy tool
  5. APM
  6. integration/escalation tools
  7. delivery and environment tools
  8. Dockerfile
  9. jenkinsfile
  10. a centralized domain that can own the set of technologies that are common to all environments
  11. maintaining version and availability for the suite of tools that are foundational to the environment

As an engineer, I remember having to deal with most of those tasks on my own. A significant part of the development time that went with a task consisted of setting up the make file for the build, figuring out what constituted service degradation, and building custom checks for all the signals that were deemed important, and, of course, logging. We have all struggled with what to log and how much to log, etc. As you can imagine, those external functions really took away from how much I could focus on both the business logic and writing good code.

At some point, engineering managers started to become aware of the amount of effort each engineer was putting into this area, and teams were fitted with an engineer dedicated to this process. “Fitted” is a generous term. It was usually a role fulfilled by either the most junior engineer on the team, or the engineer that became so proficient in this area of expertise that it made sense to dedicate them to that rather than the product.

At one of my early jobs, we affectionately referred to Jake as our “buildmeister”. He did more than set up our build files; he actually did everything BUT write business logic code.

Build vs. Buy (and when) is often a function of the product's maturity.

At the earliest stages, most of the engineering dollars are spent building the product, and the CTO/engineering manager(s) will have a heavy hand in laying the foundation for the work.

In this stage, it’s critical that engineers on the project at this stage are experienced with the entire SDLC so that Ops best practices are observed and taken into consideration as much as possible. I have seen things start off on the wrong foot here: no code repository, manual deployments, ZERO monitoring… the works.

When should the function be formalized? How big should the team be? These questions cross the mind of CTOs (good ones) all the time.

A word of caution: waiting for the house to burn down before thinking of this is the costly option in terms of both money and negative impact to your engineering team. Take a look at metrics on your sprint boards, and you will quickly see if your engineering resources are balanced.

When should we formalize ops as a function?

Photo by Christina @ wocintechchat.com on Unsplash

If we are in the unique position of building something new:

Hire engineers that are passionate about stable, maintainable, code that are familiar with proper CI/CD tools, and focus on building your initial product. Spending money on Ops resources during inception and initial product build is putting the carriage before the horse. Taking careful consideration to create a culture of ops in engineering will be paid back in the form of a highly productive engineering team.

If you have a product and your sprint metrics look terrible because your engineers are constantly being dragged into fire drills:

You have an urgent need for experienced, highly opinionated Ops engineer(s) that can wrangle that mess into a standardized process. Here, you should probably talk to your engineers and try to source a lead for this effort from within. Someone familiar with the product and the codebase would be immensely valuable in establishing a normalized environment — even if the agreement is to bring someone else on board who can collaborate during the process and eventually take over, letting the product engineer return to their domain.

It’s ideal to have Ops in place as soon as your product is ready to launch.

If there are budgeting concerns, be sure to keep the culture of Ops best practices as a constant theme in all of the sprint planning, HLD, and LLD sessions so that, at the very least, the environment is not littered with landmines and you are not exposed to unnecessary risks.

Addressing these items in all development meetings will help with:

  1. Defining service degradation signals, by looking at the product from the end user’s point of view
  2. What to do with the signal once it’s triggered
  3. Asking how will the product feature be built and deployed
  4. If we lose it all, will we be able to deploy from source control?
  5. Does your environment rely on local configuration as part of the ecosystem? If so, are their deployments automated?

While it’s beneficial to have someone familiar with the product involved in the build-out of the Ops team, I realize that this is not always feasible.

However, having zero buy-in from the wider engineering organization will ensure the stagnation, if not outright failure, of the Ops team.

If you cannot provide an internal engineer to champion this effort, then you must ensure that whomever you bring into this role has a high emotional IQ and is able to collaborate with engineers of all levels to gain their respect and, ultimately, their support.

Once the function has been seeded by an engineer that is proficient enough in this space to start steering the environment in the right direction (and who will hopefully end up leading this area), the rest of the team should be sourced internally, as much as possible.

It’s actually much easier to hire a product engineer who is knowledgeable about your organization’s programming language of choice (and the general algorithms) than it is to hire an Ops engineer that can implement standards to achieve optimal stability.

Let me be clear and emphasize that while it is very easy to express this in a few paragraphs, actually doing it is considerably much more challenging. I cannot stress enough the amount of investment that it will take to source from within the org. While having the necessary technical skills tick the basic requirements, Ops engineering follows a specific thought process. Ops engineers need to be truly in tune with minutia while being able to steer architecture into best practices.

How to identify a candidate that can successfully transition into an Ops engineering role internally:

Photo by Goran Ivos on Unsplash

Do you like the candidate?

Seriously, likability is not easy to measure and it’s not always needed (in some roles it's actually bad to be likable)… but in a role where the expectation is that this person will collaborate with the wider engineering org and sometimes the commercial side, they need to be likable.

Likable does not mean a “yes” person or a pushover. Ops engineers are in the hot seat a lot, and being likable when not presenting a popular point of view will go a long way.

On this note, can they say “no” to bad ideas/requests and, at the same time, successfully present an “Ops eng compliant” solution as an alternative?

Is the candidate technical?

I mean this in the purest of forms. Ops engineering products and processes not only tend to change all the time (and fast), but they are also loaded with customizations specific to your business.

One day, we are deploying Docker containers via Jenkins, and another day, we will be deploying Kubernetes native applications. One day it’s all about Nagios and Perl, and then before you know it, we are showing engineers how to import Prometheus libraries to instrument their code, and then deploying products using Helm charts. The underlying software can change fast. This cadence challenges Ops engineers’ abilities to redesign things to accommodate those changes.

That said, the Ops engineering team needs to be technical enough to be able to walk away from things they have mastered and pick up new things that the firm will benefit from.

Another word of caution here: make sure you have a good leader in place that can support innovation and adoption of new things while preventing reckless implementation of the latest and greatest half-baked solution. There is a forest of “solutions” out there that are not ready for production, and you need to carefully discern your options.

This is not a place for the very nervous or the faint of heart.

There will be outages, and at some point, this person will be alone in working to recover the system from something bad. The right balance of confidence, curiosity, and an even temper are qualities to value here.

Has someone expressed an interest in the work being done in your ops space?

There is a lot of hype out there in the Ops space, and people are developing an interest in cloud orchestration tools, Docker, monitoring, and so on. If someone in the org is showing interest, support them — do not overlook them. If you cannot move them, invite them to collaborate. At the very least, give them a peek into the role.

Jose Sierra is a Director of Engineering at Tapad. Connect on LinkedIn.

--

--