A simple understanding of Elasticsearch

Anubala Balasaravanakumar
4 min readJul 15, 2018

Elasticsearch is simply a search engine. Its developed in Java and is based on Lucene.

Its open sourced and can be downoladed from the website : https://www.elastic.co/downloads/elasticsearch

The data stored in the elasticsearch is called an Index, which is called a database in terms of SQL. The entries to an index are called Documents which are basically JSON. The data are stored in the ‘data’ folder in the installation directory.

All data are stored in the index. The index is divided into multiple shards. The shards may be present either in the same node or in different nodes. A collection of nodes is called a cluster.The nodes in a cluster are connected by specifying the IP addresses in the elasticsearch.yml configuration file found in the config directory.

A cluster contains the following nodes : master node, data node, coordinating node, ingest node. The master node controls the cluster. It takes care of index creation, shard allocation, deleting an index, tracking the nodes in the cluster etc. The data nodes store data. The coordinating node handles search requests. The ingest node is used to pre-process documents before the actual document indexing happens in order to enrich the document before indexing.

By default, a node is a master eligible node and data node and ingest node. You can configure a cluster to have a number of master eligible nodes by enabling node.master : true. To make a dedicated data node, configure node.data : true and the rest to be false. To make a node as an ingest node, configure node.ingest : true and the rest of the nodes is false. To make a node as coordinating node, configure all the nodes as false.

The default port for elasticsearch is 9200. This can be changed by configuring http.port : <9200>In case of clustering, all the nodes in a cluster must have the same cluster name. As discussed earlier, an index is divided into shards and each shard can be replicated and the replicas placed in a different node can act as primary node in case of failure of the primary shard. The default number of primary shard is 5 and number of replicas is 1. That is, if you create an index, 5 primary shards and a replica for each shard is created by default making the count to 10. This setting can be changed by configuring index.number_of_shards: <number of primary shards you want> and index.number_of_replicas: <number of replicas you want>

A cluster containing 2 primary shards and 2 replica shards

There are few other settings that has to be taken care of in order to productionize. The data folder should be changed to a different location in order to protect the data in case of reinstallation of elasticsearch. This has to be done if you download in .zip and .tar.gz format. Configure path.data : /path/to/data and path.logs : /path/to/logs. In Debian and rpm packages, it is already done. You can also include more than one location.

To connect different nodes in a cluster, the network address of each node is published in network.host: <xxx.xxx.xxx.xxx>. The nodes multicast their presence to all other nodes with the same cluster name. For avoiding accidental connection between some other node that my have the same cluster name, unicasting is done and the nodes that are allowed to join a cluster are specified in discovery.zen.ping.unicast.hosts: [“host1”, “host2:port”].

When we have more nodes in a cluster, and if the cluster is restarted, not all the nodes start at the same time. If the master node did not start early, the other nodes in the cluster elects a master from one of the master eligible nodes and starts to function. Later when the original master becomes active, it assumes itself as master and tries to form cluster with other nodes which are already with the newly elected master. This is called split brain problem and leads to data loss. To prevent this, configure discovery.zen.minimum_master_nodes: <(number of master eligible nodes/2)+1>. This is the minimum number of master eligible nodes required for a node to elect itself as a master. The number of nodes expected in this cluster is configured as gateway.expected_nodes: <number of nodes>. The recovery should begin only after the nodes in this gateway.recover_after_nodes: <number of nodes required before recovery>configuration are active. The time after which recovery has to begin is given by gateway.recover_after_time: <time>.

These are the important settings that are done in the configuration file. Other factors to be considered before productionizing are providing extra heap memory, extra operating system cache, SSDs instead of disks in RAID 0 configuration, large RAM size etc.

For further reference of elasticsearch in detail, please refer the link https://www.elastic.co/guide/index.html.

--

--