ES overview

Elasticsearch is a highly scalable open-source search engine and database. It is suitable for situations that require fast analysis and discovery of information held in big datasets

  • Large-scale free text search, where documents that match combinations of search terms can be quickly located and retrieved.
  • Event logging, where information can arrive from a variety of sources. The data may need to be analyzed to ascertain how a chain of events has led to a specific conclusion.
  • Storing data captured from remote devices and other sources. The data could contain varying information, but a common requirement is to present this information in a series of dashboards to enable an operator to understand the state of the overall system. Applications can also use the information to make quick decisions about the flow of data and business operations that need to be performed as a result.
  • Stock control, where changes to inventory are recorded as goods are sold. Business systems can use this information to report stock levels to users, and reorder stock if the level of a product runs low. Analysts can examine the data for trends to help determine which products sell well under what circumstances.
  • Financial analysis, where market information arrives in near real-time. Dashboards can be generated that indicate the up-to-the-minute performance of various financial instruments which can then be used to help make buy/sell decisions.

According to elastic.co:

  • Logging is 50%
  • Analytics is 25%
  • Web search is 25%

What are the common scenarios you're seeing Most of them doing monitoring

  • Exceptions
  • Exception counts
  • Metrics over time
  • Looking at trends
  • Monitoring response times
  • API response codes
  • Search and filter the logs down to figure out what happens
  • When you see some of those anomalies in Kibana, being able to dive in and do root-cause analysis

There's some nice stuff for grabbing webserver logs application metrics One customer writes it all into Cassandra (collection servers)...they will buffer and batch. Kinesis could be another pattern.

Elasticsearch implements a clustered architecture that uses sharding to distribute data across multiple nodes, and replication to provide high availability.

Documents are stored in indexes. The user can specify which fields in a document are used to uniquely identify it within an index, or the system can generate a key field and values automatically. The index is used to physically organize documents and is used as the principal means for locating documents. Additionally, Elasticsearch automatically creates a set of additional structures acting as inverted indexes over the remaining fields to enable fast lookup and perform analytics within a collection.

An index comprises a set of shards. Documents are evenly dispersed across shards by using a hashing mechanism based on the index key values and the number of shards in the index; once a document is allocated to a shard it will not move from that shard unless its index key is changed. Elasticsearch distributes shards across all available data nodes in a cluster; a single node may initially hold one or more shards that belong to the same index, but as new nodes are added to the cluster Elasticsearch relocates shards to ensure a balanced load across the system. The same rebalancing applies when nodes are removed.

Indexes can be replicated. In this case each shard in the index is copied. Elasticsearch ensures that each original shard for an index (referred to as a “primary shard”) and its replica always reside on different nodes.

Note: The number of shards in an index cannot be easily changed once the index has been created, although replicas can be added.

When a document is added or modified, all write operations are performed on the primary shard first and then at each replica. By default, this process is performed synchronously to help ensure consistency. Elasticsearch uses optimistic concurrency with versioning when writing data. Read operations can be satisfied by using either the primary shard or any of its replicas.

Figure 1 shows the essential aspects of an Elasticsearch cluster comprising three nodes. An index has been created that consists of two primary shards with two replicas for each shard (six shards in all).

Figure 1. A simple Elasticsearch cluster comprising two primary nodes and two sets of replicas

In this cluster, primary shard 1 and primary shard 2 are located on separate nodes to help balance the load across them. The replicas are similarly distributed. If a single node fails, the remaining nodes have sufficient information to enable the system to continue functioning; if necessary, Elasticsearch will promote a replica shard to become a primary shard if the corresponding primary shard is unavailable. When a node starts running it can either initiate a new cluster (if it is the first node in the cluster), or join an existing cluster. The cluster to which a node belongs is determined by the cluster.name setting in the elasticsearch.yml file. Node Roles

The nodes in an Elasticsearch cluster can perform the following roles:

A data node which can hold one or more shards that contain index data.
A client node that does not hold index data but that handles incoming requests made by client applications to the appropriate data node.
A master node that does not hold index data but that performs cluster management operations, such as maintaining and distributing routing information around the cluster (the list of which nodes contain which shards), determining which nodes are available, relocating shards as nodes appear and disappear, and coordinating recovery after node failure. Multiple nodes can be configured as masters, but only one will actually be elected to perform the master functions. If this node fails, another election takes place and one of the other eligible master nodes will be elected and take over.

By default, Elasticsearch nodes perform all three roles (to enable you to build a complete working cluster on a single machine for development and proof-of-concept purposes), but you can change their operations through the node.data and node.master settings in the elasticsearch.yml file, as follows:

Configuration for a master node

Note: The elected master node is critical to the well-being of the cluster. The other nodes ping it regularly to ensure that it is still available. If the elected master node is also acting as a data node, there is a chance that the node can become busy and fail to respond to these pings. In this situation, the master is deemed to have failed and one of the other master nodes is elected in its place. If the original master is actually still available, the result could be a cluster with two elected masters, resulting in a "split brain" problem which can lead to data corruption and other issues. The document Configuring, Testing, and Analyzing Elasticsearch Resilience and Recovery describes how you should configure the cluster to reduce the chances of this from occurring. However, ultimately it is a good strategy in a moderate to large cluster to use dedicated master nodes that take no responsibility for managing data.

The nodes in a cluster share information about the other nodes in the cluster (by gossiping) and which shards they contain. Client applications storing and fetching data can connect to any node in a cluster and requests will be transparently routed to the correct node. When a client application requests data from the cluster, the node that first receives the request is responsible for directing the operation, communicating with each relevant node to fetch the data and then aggregating the result before returning it to the client application. Using client nodes to handle requests frees data nodes from performing this scatter/gather work, and enables them to spend their time serving data. You can prevent client applications from accidentally communicating with data nodes (causing them to act as client nodes) by disabling the HTTP transport for the data nodes: