OpenStack Kilo Design Summit outcomes

This is a summary of the discussions, design decisions, goals, and direction that came out of the OpenStack Juno Design Summit in Paris (fall 2014). Unlike my previous design summit adventures, which were primarily focused on Keystone (I'll leave that to Morgan Fainberg to cover), I'm making an attempt to branch out this cycle, with an increased focus on security, scale, and stability across the broader community.

Consider this to be a sequel to my similar coverage of the Juno summit.

International Bureau of Weights and Measures

This is the International Bureau of Weights and Measures (Bureau international des poids et mesures, acronymed "BIPM") just outside Paris, France in a town called Sèvres. Inside the lower basement, an environmentally monitored vault (which requires three independently controlled keys to open, mind you) houses a cylidrical object of platinum and iridium called the international prototype kilogram. This object is literally the modern definition of the unit of mass in the metric system and it serves as the namesake of Openstack's spring 2015 release, OpenStack 2015.1 Kilo.

Pain points for operators

The OpenStack User Survey from November 2014 revealed a laundry list of specific gripes from operators. But rather than focusing on any particular issue in depth, we discussed the lines of commonality among the pain points, and what we can do as a community to better communicate these issues on an ongoing basis.

It quickly became clear that the core issue in the community is the feedback loop between operators and developers. Developers want operators to more reliably vocalize their pain points as bugs reports, so that they can be tracked, prioritized and worked, but developers also need to be better about triaging and discussing those issues as they are reported.

As a result, operators currently feel like bug reports are not the best way to communicate problems and thus look for other means of raising alarms.

Maybe we're all just gravely allergic to Launchpad? Perhaps the arrival of Storyboard will give operators and developers a sufficiently improved issue tracking experience such that we can tighten the feedback loop considerably, and at the end of the day, build better software.

Logging

While many specific examples of painful logging were cited at the summit, it was clear that the common issue was simply logging consistency, both within and across services.

Consistency breaks down into two fundamental issues: log format and log levels. The same type of log message might be logged with two different formats by two different services, or logged to both INFO and DEBUG, even within the same service. This makes grepping across logs painful for everyone, and we can all agree that logs are useless if you can't reliably access the information they contain.

Several solutions to the consistency issue were proposed, including obvious ones like ensuring developers and reviewers are well versed on written, cross-project logging guidelines. This is one example which I'd posit starts to breakdown when development occurs OpenStack's scale.

More interestingly (queue Oslo's theme song), Doug Hellman suggested that the key to achieving true consistency is simply by removing the developer from the decision of how to use various log levels and from custom authoring the message format. I believe the comment was made in the context of conventional request ID logging, but it's easy to imagine extrapolating that notion much further.

High availability

Achieving high availability for most services should be a fairly traditional exercise. Ensure your database and messaging systems are relatively HA, and you're halfway there. The exceptions are interesting, however.

For example, how do you deploy Keystone across multiple regions? If you want each region to appear seamless, then today's answer is to replicate tokens across those regions. PKI promises an improvement, but has yet to deliver a truly non-persistent solution. Until then, UUID tokens are far smaller to replicate if you really want to, and have otherwise equivalent scaling characteristics.

The immediate challenge, however, is establishing a highly available reference architecture to test against. Without that, I think it's tough to argue that HA is a community supported feature of OpenStack.

API Working Group

The OpenStack API Working Group had it's first real assembly at the summit, in the form of three time slots: a double session to establish process and procedures and a workshop session to test the waters as a group (and hopefully accomplish something productive).

The API Working Group's purpose is to propose, discuss, review, and advocate for API guidelines for all OpenStack Programs to follow.

On the topic of procedures, there was a lot of support to asking authors of API-impacting changes to use an ApiImpact tag which can be queried in gerrit across projects (note that gerrit searches are case sensitive, and there are two tags floating around, the documented "APIImpact" and the actually-utilized-in-the-real-world "ApiImpact"). This will allow the API Working Group to find and focus on relevant changes, regardless of which project they occur in.

In the workshop session, our primary goal was to produce our first agreeable guideline. This gave us a chance to experiment with the documentation structure, developer useability, technical phrasing, and peer review process.

In the end, it took us a couple hours to produce a single, three line guideline on the usage of HTTP 201 Created which almost immediately received a legitimate -1 from someone outside the room (thanks, Ian), and has continued to spur a healthy debate.

We also agreed to be incredibly careful in what guidelines are ultimately merged, noting that we will likely never achieve complete unanimity on even the most trivial issues. A majority voting process is the obvious solution, but I'm tempted to suggest a condorcet voting method to establish a small team of core reviewers instead.

Stable branches

Unfortunate history has taught us that no one steps up to maintain stable branches as their full time responsibility, which results in frequently broken, unmaintained "stable" codebases. For example, as much as we'd like to guarantee stability, we often end up being broken by upstream dependencies, and even if someone bothers to produce a fix, there's not enough active stable maintainers interested in reviewing it.

Part of the problem is wrangling up accountable individuals on a per project basis, otherwise it's just a game of herding cats. Enter stable branch liaisons (volunteer today! just add yourself to the wiki page). The liaisons concept has worked well for Oslo, essentially acting as specialized maintainers and a single point of contact for relevant issues within a project. And critically, it scales to N projects.

Stable propaganda

Also of note, we took the opportunity to discuss the eligibility rules for patches proposed against stable branches, specifically around performance improvements and new configuration options. I would personally love to see pure performance improvements land in stable branches, as long as they present no additional risk and their performance advantage can be clearly justified. There was no vocal opposition to that from the stable branch consumers represented. Configuration options are still a touchy issue, as they smell a bit too much like features.

Vulnerability management

Continuing with the release management theme, the vulnerability management session established a rating taxonomy for incidents (if this is better documented somewhere else, please let me know so that I can link to it instead):

  • Class A: A vulnerability in OpenStack-supported code
    • master: fix
    • stable/*: fix backported
    • Security advisory (OSSA)
  • Class B1: A vulnerability based on configuration defaults or documentation
    • master: fix
    • stable/*: no backportable fix is possible, but an OSSN can recommend a workaround
    • Security note (OSSN) for previous versions
  • Class B2: A vulnerability based on architecture or design
    • master: featureful fix
    • stable/: no fix is possible that fits stable maintenance criteria
    • Security note (OSSN) for all versions
  • Class C1: Not considered a practical vulnerability (but some people might assign a CVE for it)
    • Security note (OSSN) possible
  • Class C2: A vulnerability, but not in OpenStack-supported code
    • Security note (OSSN) possible
  • Class D: Not a vulnerability (just a bug with security implications)
    • Security note (OSSN) possible
  • Class E: Not a vulnerability at all
  • Class Y: Vulnerability only found in development release
    • Security advisory (OSSA) possible

There is also a review to (finally) introduce a private instance of gerrit for reviewing security vulnerability patches. No more reviewing .patch files in Launchpad!

Hierarchical multitenancy

Hierarchical multitenancy was once more a hot topic in the Keystone community, and the concept has the potential to impact every other OpenStack service. The topic deserves it's own discussion here.