Median Duration of a Project Build

🏑 Home πŸ“– Chapter πŸ‘ˆ Prev πŸ‘‰ Next
⚑  ElasticsearchBook is crafted by Jozef Sorocin (🟒 Book a consulting hour) and powered by:
Β 

To Nest or Not to Nest?

If you're coming from an RDB background, you may find it challenging to flatten your document structure to conform with the NoSQL principles that Elasticsearch enforces.
In order to decide which flattening strategy to choose, I'd recommend to first read this blog post that deals with managing document relations.
Β 
From my point of view, there are essentially two options to consolidate related documents:
Β 
I prefer nested fields above all else for these three reasons:
  1. One document = one single source of truth.
  1. The possibility of a multi-level nested-ness simulating an RDB hierarchy.
  1. The powerful reverse_nested aggregation which allows for joining back to a parent document higher up in the nested structure. A real-world example can be found here on SO.
⚠️
Nested fields are not suitable for use cases where any of the nested attributes change frequently and need to be updated. Frequent updates can cause performance bottlenecks at scale.
Β 
Having said that, let's explore a typical DevOps scenario and take full advantage of what nested fields have to offer.

Use Case: Calculating the Median Duration of a Project Build

Let's say we're tracking build pipelines. A build pipeline has 1…n steps, any of which can take any number of milliseconds to run:
{
  "start_ms": "1611252094540",
  "steps": [
    {
      "duration_ms": 19381,
      "name": "build_backend_dmLCtSrXh7o",
      "tags": ["backend", "java"]
    },
    {
      "duration_ms": 2081,
      "name": "build_frontend_dmLCtSrSR5E",
			"tags": ["frontend", "reactjs"]
    }
  ]
}
We'd like to know
  • a frontend build's average duration
  • the median duration of the full builds
Β 

Approach

⚠️ A common mistake that people make is...
how they'd structure the steps. Some would consider the array-of-objects structure too verbose and repetitive. They'd rather opt for an object of objects:
{
  ...
  "steps": {
		"build_backend_dmLCtSrXh7o": {
			"duration_ms": 19381,
      "tags": ["backend", "java"]
		},
    "build_frontend_dmLCtSrSR5E": {
			"duration_ms": 2081,
			"tags": ["frontend", "reactjs"]
		}
  } 
}
The problem with this is that any of the steps can (and most probably will) be named differently. As we've learned here, when ES encounters new fields that have not previously been defined in the mapping, it'll create these mappings itself.
This will lead to a mapping size explosion and render seemingly trivial queries impossible to construct. You see, if we were to search for any tags that contain the keyword backend:
POST builds/_search
{
  "query": {
    "term": {
      "steps.????.tags": {    <-- 
        "value": "frontend"
      }
    }
  }
}
which step name would we target? Wildcards don't work here. Scripts do but there are better ways to go about this β†’
πŸ’‘
Enter nested fields, i.e. arrays of key-value pairs whose keys are usually shared across all the array objects.

Already purchased? Sign in here.