This is a rough transcript of a talk I gave at Jekyll Conf 2016, Elasticsearch for Jekyll.

Hi! I’m Alli. I’m an engineer and designer at Bonsai.io. Bonsai is a popular hosted Elasticsearch service that people can use as a vanilla direct service or through the Heroku addons marketplace. The umbrella company One More Cloud been around since 2009 when we launched our first service, Websolr, which is hosted Solr. We were the first to launch hosted Solr as a service in 2009, and the first to launch hosted Elasticseach as a service in February of 2012. We’ve been completely bootstrapped since day one, and we’re a small team of 8 spread all over the states (and sometimes Israel, Bali, and Spain).

I spend most of my time in our Rails and Jekyll projects, and write our front-end stuff (which is mostly React these days). I’m excited to be here and talk about how to integrate Elasticsearch and Jekyll using our new gem Searchyll.

Searchyll has been developed recently by three people on the Bonsai product team, our founder Nick Zadrozny, Rob Sears, and myself. It’s designed to send data to an Elasticsearch cluster so it can later be searched.

What is Elasticsearch?

Elasticsearch, at its core, is an open-source search engine. Just like Solr, it is based on Apache’s Lucene search library. Elasticsearch gives developers access to an enterprise-grade engine that is distributed, scalable, and has a rich API. It is incredibly easy to get started with, devilishly hard to host, and very very popular right now.

There are two main parts of using Elasticsearch: sending it data to create a Lucene inverted index, and searching that data using the API’s _search endpoint. Basic query requests look something like this:

image

So, Why Elasticsearch & Jekyll?

I think it feels unintuitive that we would be able to integrate Elasticsearch with Jekyll. At first glance it seems like that kind of tool that needs a lot more support than a static generator can give. However, this is moot: Elasticsearch is its own webserver. It speaks HTTP, so we can send data from anywhere, and we can simply use a client-side framework to allow users to interact with it. Additionally, if we run the indexing process locally you can keep your site on Github pages, which is way cool.

Getting Started

There are four things you need to get this to work:

  1. A full-access URL to an Elasticsearch cluster. This is for sending data.
  2. A public read-only URL to that cluster, for client-side searching.
    • You can grab these URL’s by signing up for an Elasticsearch hosting provider. Of course, I would say use Bonsai because we’re the best 😆! We have a free developer Sandbox plan we’re about to rollout for direct customers in a couple of weeks - they work similarly to Heroku’s free dynos. If you ping me at allison@bonsai.io and just mention you were at Jekyll Conf, I can whitelist you to be able to use the developer plan and our public readonly URL feature.
  3. A Jekyll project with posts! I used my blog which is now using Elasticsearch in production - check it out with the searchbox in the navigation above!
    • Make sure that any part of a post that you want to make searchable is wrapped in an <article/> tag. I wanted to ensure that there was some control over what was full-text-searchable. I did this by wrapping an <article/> tag around your {{ content }} for your post layout.
  4. A Javascript framework for client-side interaction via HTTP. I’m using React and Searchkit.

Setting up your project for Searchyll

Add the gem to your Gemfile and config.

Gemfile:

gem "searchyll"

config.yml:

gems:
  - searchyll

elasticsearch:
  url: ELASTICSEARCH_FULL_ACCESS_URL        # Optional. Depends on use case*
  number_of_shards: 1                       # Optional.
  number_of_replicas: 1                     # Optional.
  index_name: "jekyll"                      # Optional.
  default_type: "post"                      # Optional.

Again, make sure your searchable content is in an <article/> tag.

  • This value is sensitive, so you should be careful to not push it to a public repo. Although, with Bonsai you can rotate your credentials if something happens. (I accidentally pushed my url with credentials one late night working on this project, and was able to rotate my credentials on the spot once I realized my error.) You should check with you local ops engineer on how to manage this secret given your deploy context. I’m actually using AWS to host my personal site. And even though it’s in a private repo on Github, I don’t commit it but pass it in as a ENV variable when running my deploy script. But it’s super useful to store it in you config during development.

Running the indexing process

Searchyll indexes on build, so you can index to your cluster locally by running:

  $ BONSAI_URL="https://user_name:password@trial-jekyll-1468587631.us-east-1.bonsai.io" jekyll build

Or, if you’re going to use it in your deploy:

  $ BONSAI_URL="https://user_name:password@trial-jekyll-1468587631.us-east-1.bonsai.io" bin/deploy

Under the Hood

The Searchyll gem works by running an Indexer on a Jekyll Hook during the :post_render phase. It grabs the full html document, strips the html and to take out the text from the article tag, and sends it off to Elasticsearch as “text” in an object. Also in that object is the yaml front matter, post data, and the document id (the post’s permalink):

Jekyll::Hooks.register :posts, :post_render do |post|
    # strip html
    nokogiri_doc = Nokogiri::HTML(post.output)

    indexer = indexers[post.site]
    indexer << post.data.merge({
      id:     post.id,
      text: nokogiri_doc.xpath("//article//text()").to_s.gsub(/\s+/, " ")
    })
  end

source: https://github.com/omc/searchyll/blob/master/lib/searchyll.rb

I took a screenshot of our logs while indexing recently to show the steps the Searchyll Indexer is going through:

image

Cool things Searchyll does:

  • Follows best practices to ensure no downtime if you’re reindexing (uses alias to hot reindex).
  • Indexes in batches using the _bulk endpoint, so you don’t run into concurrency limits/
  • Allows you to customize your shard and replication settings, among other index settings. Not sure what shards and replicas are? Check out our docs here: What are Shards and Replicas?.
  • Works with Github Pages - given a small change.

Deploying for Github Pages

Deploying to Github pages and using Searchyll requires one workaround. You’ll need to index locally during a build since you can’t run the indexing in a deploy script. Remove the gem and its related config values, then push to Github pages.

Searching your Posts

Searching your posts with client-side javascript it a personal thing, and probably the hardest bit of all of this. My experience thus far with Searchkit has been great; their documentation is fairly thorough. Going into detail about Searchkit would be out of scope here, but I’ve written a blog post about how to integrate Jekyll with webpack and React here: Using Webpack and React with Jekyll - this is a good foundation for building out Searchkit components for your Elasticsearch cluster. The search works pretty beautifully, although there are a few UI kinks I’m trying to iron out since it’s my very first time working with Searchkit, but it’s incredibly easy getting started.

Here are just a couple of cool things I’ve built so far:

1. search highlighting, AND, OR, and NOT:

image

my Searchkit hit results component:

const HitItem = (props) => (
  <div className={props.bemBlocks.item().mix(props.bemBlocks.container("item"))}>
    <a href={ props.result._source.id + "/" }>
      <div className={props.bemBlocks.item("title")} dangerouslySetInnerHTML=></div>
    </a>
    <div><small className={props.bemBlocks.item("hightlights")} dangerouslySetInnerHTML=></small></div>
  </div>
)

2. Filtering results based on front matter tags or categories:

Since we’re sending every post’s full data to Elasticsearch, we can create some search filtering based on tags you define in the front matter (I could see someone using categories easily, too):

---
layout: post
title:  Using Webpack and React with Jekyll
categories: code
comments: true
summary: How to use webpack with Jekyll so you can use npm modules, such as react or searchkit.
tags:
  - code
---
# md

my Searchkit Filter component:

  ...
  render() {
    return(
      <div>

      ...

        <RefinementListFilter field="tags" title="Tags" id="tags"/>

      ...

      </div>
    )
  }

Turns into this:

image

In Closing

I’m really excited about this little gem and its capabilities, especially since using it with Searchkit just makes development super easy and fun. Moreover, it doesn’t cause the classic “host it yourself” deploy problem. This was a priority and I’m happy that there’s an easy workaround for that.

Fork the project, use it in a custom way, or submit a pull request or issue. And, if you ever need help setting this up for your own site, feel free to ping me here, or at allison@bonsai.io or Twitter at @allizad. Don’t forget to send me a message if you’re interested in our free dev Sandbox cluster and using it with a public read-only URL.

Leave questions and comments below, and share what you build! :)