The comms tower on top of Riverston, Sri Lanka

As the series on ElasticSearch deployment management in K8s is complete, I thought of writing down some of the Index Management tasks that I had to implement in order to reduce the manual work involved in cluster maintenance.

Following is the series of posts on ElasticSearch on K8s.

  1. ElasticSearch on K8s: 01 — Basic Design
  2. ElasticSearch on K8s: 02 — Log Collection with Filebeat
  3. ElasticSearch on K8s: 03 - Log Enrichment with Logstash
  4. ElasticSearch on K8s: 04 - Log Storage and Search with ElasticSearch
  5. ElasticSearch on K8s: 05 - Visualization and Production Readying
  6. ElasticSearch Index Management
  7. Authentication and Authorization for ElasticSearch: 01 - A Blueprint for Multi-tenant SSO
  8. Authentication and Authorization for ElasticSearch: 02 - Basic SSO with Role Assignment
  9. Authentication and Authorization for ElasticSearch: 03 - Multi-Tenancy with KeyCloak and Kibana

The following management steps are not mandatory to be implemented in a cluster to be production ready, however having them in place would greatly reduce some of the common headaches involved in an ELK stack management.

Segregating Data into Indices

As discussed in the post related to Logstash, logs could be sent to ElasticSearch in a way that logs from different sources may end up in different indices. In that particular example, logs generated in different K8s Namespaces were sent to different indices that were again separated based on the date on which the logs are published (as opposed the date on which the logs are generated, this is an important distinction to make when the logs are analyzed later).

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logstash-%{+YYYY.MM.dd}-%{[kubernetes][namespace]}"
  }
}

What happens here is simple. The index to which a particular log line is pushed to is dynamically named from a field that is included in the event itself, the nested field, [kubernetes][namespace]. This can be expanded more by separating indices based on more factors, such as the applications from which the logs are generated (ex: using [kubernetes][labels][app] perhaps) or any other custom field extracted from the log line itself (ex: requests made from within the cluster determined by the source IP address present on an Nginx Access Log).

However, this kind of separation could easily get out of hand, where too many indices (and therefore shards) are created on the ElasticSearch cluster, with too little data in each index. This consumes more resources just to keep the shards running. Therefore, a healthy balance between sensible data separation and a manageable shard count should be targeted.

A generally useful separation is to make use of the date based separation (which would put logs from each day on a separate index) along with one more factor (ex: Namespace, Application, etc). This pattern usually makes sure that (given a stable cluster) data would be separated out to even sized indices.

The date based separation is going to help in another way too.

Automating Data Retention Policies

As mentioned above, a data segregation strategy would likely result in more number of indices than a cluster that doesn’t separate data based on one or more factors. Over a period of time, the use of the these indices go down, mainly because majority of the search queries will be made against a small window of time, such as for the past 15 minutes, past 24 hours or at most for a past couple of days. Queries made for data older than a few weeks are rare and beyond that the frequency of lookups go down almost exponentially.

Similarly, the number of writes on indices (if they are separated by date, daily) would also go down as the particular index becomes older. In fact, wrtiting data into an index older than a day is rare and would happen under abnormal circumstances (ex: if some kind of a dead letter queue is implemented and causes failed events to be retried after days of the original publishing).

With these behaviors in mind, we can start implementing a data retention policy to define what the time period should be within which data will be available for querying. Any indices that fall out of this window can be safely deleted. Deleting these “unwanted” indices will free up memory that will have to be committed to keep shards running, and it will also reduce the number of query threads spawned for a query with an unsafe index pattern.

We can decide on a number of days for an index to be active. After this count is passed for each index, they should be deleted. ElasticSearch has a function named Index Lifecycle Managmenet Policy that makes it easier to write down policies like these and have them enforced automatically.

For an example, we can define an ILM policy to delete any matching index older than 30 days.

curl -XPUT -H "Content-Type: application/json" "http://elasticsearch:9200/_ilm/policy/delete_after_30d" -d'
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
'

Now the policy is defined, however it’s not enforced on any index. To do that, the newly created policy has to be associated to an Index Template. An Index Template is a collection of settings applied to an index when it’s created. Which template to be applied for a newly created index depends on the name of the index and the index_patterns specified on the templates. This means, zero or more templates could be potentially applied for an index.

For an example, if we want the above policy to be applied to any index with name that matches with logstash-*, we can define the following index template.

curl -XPUT -H "Content-Type: application/json" "http://elasticsearch:9200/_template/logstash_ilm" -d'
{
  "order": 10,
  "index_patterns": [
    "logstash-*",
  ],
  "settings": {
    "index": {
      "number_of_shards": "1",
      "refresh_interval": "5s",
      "lifecycle": {
        "name": "delete_after_30d"
      }
   }
  }
}
'

The ILM policy defined above will start getting applied to indices created after creating the above index template, with a matching name that starts with logstash-.

If an ILM policy is changed later (ex: changing the number of days before the index is deleted), the changes will be applied immediately to all the indices the particular ILM policy is applied to.

For an index that an ILM policy is applied to, the current phase and the time that the index will go into the next phase can be checked from the index metadata.

As mentioned above, separating data into daily indices can be useful in index management policies. If such a separation is not made, efficiently deleting outdated data will become complex, as parts of indices will contain outdated data while other parts may have active and relevant data.

Explicit vs Dynamic Mapping

ElasticSearch is a database for unstructured data. When a log is fed to ElasticSearch, it will try to decode the fields in the record and guess the data type for each field. This is called Dynamic Mapping, and how the data type of each field is determined is specified in the ElasticSearch documentation. This is mostly useful during the early stages of a project where multiple types of data could be stored in ElasticSearch and their structure would frequently change.

However, this dynamic mapping could sometimes result in unexpected data types guessed for certain fields. Fields that are intended to be full-text searcheable may end up being marked as Keyword type fields. Or, with slight changes to the incoming data, the same field on multiple indices could end up being marked as either an integer or a text (which could interfere with queries). It’s also possible to cause some level of instability in ElasticSearch with a set of carefully designed data (ex: having a high number of unique fields causing what’s called a mapping explosion).

If a common schema for the incoming data is known, that schema can be applied for a given index pattern. These could be defined as Mappings. This will make sure no guess work is involved for a given field and the potentially unstable behavior of dynamic mapping is avoided.

The common schema may not be visible at first. However as the deployment becomes more stable, this will start become clearer. Adding explicit mappings will also allow to handle changes to the system more carefully.

Managing the Query Load

When a query is issued on the ElasticSearch API, the index to search for is usually defined. On Kibana, the indices to search for are defined by an Index Pattern. An index pattern is a string pattern that index names should match to. Each query made on Kibana will be executed against each index with names matching the given index pattern.

With this tool, it’s easy to define patterns like logstash* which includes all the indices created by Logstash (usually). However, there’s a cost that is associated with each query, and that cost could be increased based on the index pattern selected to do the query with.

For each index, a query generates a separate thread. These threads execute the query on the indices and return the results to be aggregated. Therefore, for an index pattern like logstash* there could be dozens of threads spawning for each query. If the time window applied for the query is somewhat larger (ex: a few days) such a query would result in a massive CPU and memory spike on the ElasticSearch cluster.

Since most queries will be for a small time window (at most expanding to a few hours) the indices of importance will only be the ones created on the same day (or in some cases the day before). Therefore, by creating index patterns that restrict the number of indices matched for a certain query, we can reduce the load on the system drastically.

Automated Index Pattern Creation

However, creating index patterns for each day could soon become an overhead on the maintenance of the cluster. This could easily be automated using ElasticSearch and Kibana APIs.

# creating an index pattern for the current day
index_pattern_id=$(cat /proc/sys/kernel/random/uuid)
curl -XPOST -H "Content-type: application/json" -H "kbn-xsrf: kibana" "http://kibana:5601/api/saved_objects/index-pattern/${index_pattern_id}" -d '
{
  "attributes": {
    "title": "logstash-*$(date +"%Y.%m.%d")", 
    "timeFieldName": "@timestamp"
  }
}
' 

# marking the created index pattern as the default one
curl -XPOST -H "Content-type: application/json" -H "kbn-xsrf: kibana" "http://kibana:5601/api/kibana/settings" -d'
{
  "changes": { 
    "defaultIndex": "${index_pattern_id}"
  }
}
'

The above two API calls creates an index pattern and marks it as the default one.

Marking an index pattern as default saves the user from an astonishing level frustration when it comes to querying on Kibana.

These two API calls can be configured to run once each day at the start of the day (ex: using a K8s CronJob). That should take care of the mundane task of creating efficient index patterns.