Seeding Elasticsearch with test data using zefaker and esbulk

In this short article, we will see how you can use zefaker to generate 5 million records of random data which can easily be indexed in Elasticsearch using esbulk.

Prerequisites:

Creating the zefaker file

Firstly, we need to create a Groovy script to use with the zefaker to specify the form of our random data. You can copy the code snippet below and place it in a file named data.groovy which we will pass to zefaker to generate our data.

// in data.groovy
import com.google.gson.JsonObject

firstName = column(index= 0, name= "firstName")
lastName  = column(index= 1, name= "lastName")
age       = column(index= 2, name= "age")

accountStatus = column(index=3, name="accountStatus")
accountMeta   = column(index=4, name="accountMeta")

generateFrom([
    (firstName): { faker -> faker.name().firstName() },
    (lastName): { faker -> faker.name().lastName() },
    (age): { faker -> faker.number().numberBetween(18, 70) },
    (accountStatus): { faker -> faker.options().option("Open", "Closed") },
    // You can nest objects like this
    (accountMeta): { faker -> 
        def meta = new JsonObject()
        meta.addProperty("totalTokens", faker.number().numberBetween(5000, 10000))
        meta.addProperty("activityStatus", faker.options().option("Active", "Dormant"))
        return meta
    }
])

Generating the data

zefaker requires Java to be installed to run. I'm assuming you have the java command in your PATH. With that we can run the following to generate 5 million rows of random data exported into a JSON Lines format (basically a plain text file where each line is a JSON Object).

$ java -jar zefaker-all.jar -f data.groovy -jsonl -output elasticdata.jsonl -rows 5000000

You can also try the zefaker web instance I have running here, this will save you from having to install Java or zefaker on your machine. Make sure you select JSON Lines as the export option

Getting esbulk utility

We will use esbulk, a nifty small command-line program written in Go, to perform the indexing. We will have to build it first.

$ git clone https://github.com/miku/esbulk

$ cd esbulk

$ go build

This will create an executable named esbulk (esbulk.exe on Windows). You can add it on your PATH

Indexing the data in Elasticsearch

Again, it is assumed that you have installed Elasticsearch or OpenSearch and have it running. We can use the following command to index our data:

$ esbulk -index "people-2021.07.07" -optype create -server http://localhost:9200 < elasticdata.jsonl

After esbulk completes (silently) you can check that the operation was successful by visiting localhost:9200/people-2021.07.07/_search in your browser or using curl like so:

$ curl -G http://localhost:9200/people-2021.07.07/_search

And, that's all folks. Hope you found this useful.

Show some love and star zefaker :)