Seeding Elasticsearch with test data using zefaker and esbulk

In this short article, we will see how you can use zefaker to generate 5 million records of random data which can easily be indexed in Elasticsearch using esbulk.

Prerequisites:

zefaker (version 0.6 at the time of writing)
JDK 15+
Go 1.16+
esbulk
Elasticsearch 7+ / OpenSearch 1.0.0
curl (optional, but really you should have this already)

Creating the `zefaker` file

Firstly, we need to create a Groovy script to use with the zefaker to specify the form of our random data. You can copy the code snippet below and place it in a file named data.groovy which we will pass to zefaker to generate our data.

// in data.groovy
import com.google.gson.JsonObject

firstName = column(index= 0, name= "firstName")
lastName  = column(index= 1, name= "lastName")
age       = column(index= 2, name= "age")

accountStatus = column(index=3, name="accountStatus")
accountMeta   = column(index=4, name="accountMeta")

generateFrom([
    (firstName): { faker -> faker.name().firstName() },
    (lastName): { faker -> faker.name().lastName() },
    (age): { faker -> faker.number().numberBetween(18, 70) },
    (accountStatus): { faker -> faker.options().option("Open", "Closed") },
    // You can nest objects like this
    (accountMeta): { faker -> 
        def meta = new JsonObject()
        meta.addProperty("totalTokens", faker.number().numberBetween(5000, 10000))
        meta.addProperty("activityStatus", faker.options().option("Active", "Dormant"))
        return meta
    }
])

Generating the data

zefaker requires Java to be installed to run. I'm assuming you have the java command in your PATH. With that we can run the following to generate 5 million rows of random data exported into a JSON Lines format (basically a plain text file where each line is a JSON Object).

$ java -jar zefaker-all.jar -f data.groovy -jsonl -output elasticdata.jsonl -rows 5000000

You can also try the zefaker web instance I have running here, this will save you from having to install Java or zefaker on your machine. Make sure you select JSON Lines as the export option

Getting `esbulk` utility

We will use esbulk, a nifty small command-line program written in Go, to perform the indexing. We will have to build it first.

$ git clone https://github.com/miku/esbulk

$ cd esbulk

$ go build

This will create an executable named esbulk (esbulk.exe on Windows). You can add it on your PATH

Indexing the data in Elasticsearch

Again, it is assumed that you have installed Elasticsearch or OpenSearch and have it running. We can use the following command to index our data:

$ esbulk -index "people-2021.07.07" -optype create -server http://localhost:9200 < elasticdata.jsonl

After esbulk completes (silently) you can check that the operation was successful by visiting http://localhost:9200/people-2021.07.07/_search in your browser or using curl like so:

$ curl -G http://localhost:9200/people-2021.07.07/_search

And, that's all folks. Hope you found this useful.

Show some love and star zefaker :)

Seeding Elasticsearch with test data using zefaker and esbulk

Creating the `zefaker` file

Generating the data

Getting `esbulk` utility

Indexing the data in Elasticsearch

More from this blog

How to populate a PostgreSQL test container in Go for integration testing

Generating test data in JSON, CSV and SQL formats using Go and Faker

Machine Learning based SPAM detection using ONNX in Java

Using the Zig built LightPanda browser for web automation via playwright-go

Preparation Isn’t Enough: Preempting Your Future Self

Command Palette

Creating the zefaker file

Generating the data

Getting esbulk utility

Indexing the data in Elasticsearch

More from this blog

Creating the `zefaker` file

Getting `esbulk` utility