DevSecOps Course Labs

Running Elasticsearch

Elasticsearch is a document database - it uses JSON as the data format. Documents are stored in collections called indices, and the data schema is flexible so not all documents in an index need to have the same fields.

You can use Elasticsearch as a generic data store, but it's particularly well suited to storing logs because it gives you lots of advanced querying features.

Reference

Running Elasticsearch

Elasticsearch is a Java application. The licensing model is a bit involved, but up to version 7.10 it's published under an open-source licence.

compose.yml sets up Elasticsearch to run in a container, publishing port 9200 which is the default port for the HTTP API.

Start the container:

docker-compose -f labs/elasticsearch/compose.yml up -d

Check the logs and you'll see Elasticsearch starting up:

docker logs courselabs_elasticsearch_1 

These are semi-structured logs

We'll use curl to make HTTP requests - if you're using Windows, run this script to use the correct curl version:

# only for Windows - enable scripts:
Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope Process

# then run:
. ./scripts/windows-tools.ps1

Now make a simple call to the Elasticsearch API:

curl localhost:9200

📋 There's some basic info in the API response. What version are we running, and what is the cluster name?

Need some help?

The API response looks like this:

{
  "name" : "68d8e3d046c4",
  "cluster_name" : "elkstack",
  "cluster_uuid" : "9yypBMAjRNC0hjMkr-FrEw",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

The version number and cluster name are set in the Docker image - 7.10.2 and elkstack.


The Elasticsearch API has a full feature set for administering the cluster and for working with documents.

Indexing documents

Indexing is how you store data in Elasticsearch. There are client libraries for all the major languages, so you can integrate Elasticsearch with your application.

We'll use the REST API in these exercises - start by inserting a document into a new index:

Index the document using an HTTP POST request:

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/logs/_doc' --data-binary "@labs/elasticsearch/data/fulfilment-requested.json"

The output includes an ID you can use to retrieve the document.

📋 What is the name of the index and the document ID?

Need some help?

The API response looks like this:

{
   "_index":"logs",
   "_type":"_doc",
   "_id":"ZODwunoBUFcX3q_Yl3rW",
   "_version":1,
   "result":"created",
   "_shards":{
      "total":2,
      "successful":1,
      "failed":0
   },
   "_seq_no":0,
   "_primary_term":1
}

The index name is logs - you don't need to create indices in advance, Elasticsearch will create them when you try to add documents.

The document ID is generated by Elasticsearch because we didn't specify an ID in the request. In this example it's ZODwunoBUFcX3q_Yl3rW.


You can fetch the document back using an HTTP GET request - you'll need to set your own document ID in the URL:

curl localhost:9200/logs/_doc/<_id>?pretty

The ?pretty flag formats the response to make it easier to read.

This is structured data - the log level, timestamp and message data is all stored in separate fields.

📋 Add more logs by indexing two more documents, from the files in labs/elasticsearch/data/fulfilment-completed.json and labs/elasticsearch/data/fulfilment-errored.json.

Need some help?

The POST requests are the same, only the path to the source file is different:

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/logs/_doc' --data-binary "@labs/elasticsearch/data/fulfilment-completed.json"

curl -H 'Content-Type: application/json' -XPOST 'localhost:9200/logs/_doc' --data-binary "@labs/elasticsearch/data/fulfilment-errored.json"

You can use a CAT (compact and aligned text) API to check the index has all the documents:

curl localhost:9200/_cat/indices?v=true

It can take a few minutes for the status to update, but you should see docs.count column with the value 3 for the logs index.

Now we have some data we can search.

Basic searching

The simplest search in Elasticsearch is to call the _search endpoint on the index API, passing a search term in the querystring for the URL.

Search for the word debug in any document in the logs index:

curl 'localhost:9200/logs/_search?q=debug'

Finds a single document, with the log level of DEBUG - the search is not case-sensitive.

This basic search looks in all the fields in all the documents. The response includes a score for each document, which is a calculation of how good a match it is for the search term.

One more simple query: adding a minus (-) before the search term finds any documents which don't contain the term:

curl 'localhost:9200/logs/_search?q=-debug&pretty'

Finds the other two documents, as they don't have the word debug in any field.

You can write more complex query expressions in JSON, using the Elasticsearch Query DSL. Here are some examples using match queries:

Using the Query DSL with structured data lets you be more precise. If you're looking for debug logs you can search using the log level field, so you won't accidentally include documents which have the word "debug" in another field.

You send JSON queries as GET requests to the search API, using this URL format:

localhost:9200/<index_name>/_search?pretty=true --data-binary "@<query_file_path>"

📋 Run some queries to find all logs about document request ID 21304897, and then just the info logs for that request ID.

Need some help?

You can use the JSON files in the queries folder.

To find all logs for that ID:

curl -H 'Content-Type: application/json' localhost:9200/logs/_search?pretty=true --data-binary '@labs/elasticsearch/queries/match-id.json'

Returns two matches, one info log and one debug.

To find just the info logs for that ID:

curl -H 'Content-Type: application/json' localhost:9200/logs/_search?pretty=true --data-binary '@labs/elasticsearch/queries/match-id-level.json'

Returns a single match.


There are lots of search features in Elasticsearch, so the Query DSL is quite complex. We've just had an introduction here, but it's a topic to return to if you'll be using Elasticsearch a lot.

Lab

Time for some practice of the index and search APIs. You've loaded individual documents into an index, but it's much quicker to bulk load them.

Start by bulk indexing all the documents in the file data/logs.json (note the data directory is in the root of the repo folder) - you'll need to use a different Document API for that.

Now write some match queries to find:

Stuck? Try hints or check the solution.


Cleanup

Cleanup by removing all containers:

docker rm -f $(docker ps -aq)