Indexing in Bulk

🏡 Home 📖 Chapter Home 👈 Prev 👉 Next
⚡  ElasticsearchBook is crafted by Jozef Sorocin (🟢 Book a consulting hour) and powered by:
Once you've established a solid mapping, you'll want to index multiple documents at once using the Bulk API. A typical payload to the _bulk endpoint would be sent as newline-delimited JSON (ndjson) but since this format is quite verbose and often hard to get right, it's helpful to use the client libraries' helpers instead. Nonetheless, we'll cover the ndjson format too in case you don't plan on using any client library.
 
💡
There's no "optimal" payload chunk size due to a plethora of factors but a good amount to start with is 1,000 documents, or 5MB per request.
 

Use Case 1: JSON Files in python

How do I index a JSON file consisting of an array of objects in python?

Approach

Using no python client lib, only requests
import requests
import json

index_name = 'my_index'
# we can either append the index name to the url like here
# or add it as one of the the `_index` attributes down the line
endpoint = 'http://localhost:9200/_bulk/' + index_name
data = []

with open('file.json', 'r') as json_in:
    docs = json.loads(json_in.read())
    for doc in docs:
        # if '_id' is blank or left out, it'll be auto-generated
        data.append({'index': {'_id': doc['_id']}})
        data.append(doc)

# rudimentary conversion into ndjson
payload = '\n'.join([json.dumps(line) for line in data]) + '\n'

r = requests.put(endpoint,
                 # `data` instead of `json`
                 data=payload,
                 headers={           
                     # it's a requirement
                     'Content-Type': 'application/x-ndjson'
                 })

print(r.json())
Using the elasticsearch-py library
import json

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

es = Elasticsearch('http://localhost:9200')
index_name = 'my_index'

data = []
with open('file.json', 'r') as json_in:
    data = json.loads(json_in.read())

actions = [
    {
        '_index': index_name,
        '_id': doc['_id'] # if None, the ID will be auto-generated
        '_source': doc
    } for doc in data
]

bulk(es, actions)
 

Already purchased? Sign in here.