A New Way to Handle Synonyms in a Search Engine

We recently added the support for Synonyms in Algolia! It has been the most requested feature in Algolia since our launch in September. While it may seem simple, it actually took us some time to implement because we wanted to do it in a different way than classic search engines.

What's wrong with synonyms

There are two main problems with how existing search engines handle synonyms. These issues disturb the user experience and could make them think "this search engine is buggy".

Typeahead

In most search engines, synonyms are not compatible with typeahead search. For example, if you want tablet  to equal  ipad in a query, the prefix search for t , ta , tab , tabl  & table  will not trigger the expansion on iPad ; Only the tablet query will. Thus, a single new letter in the search bar could totally change the result set, catching users off-guard.

Highlighting

Highlighting matched text is a key element of the user experience, especially when the search engine tolerates typos. This is the difference between making users think "I don't understand this result" and "This engine was able to understand my errors". Synonym expansions are rarely highlighted, which breaks the trust of the users in the search results and can feel like a bug.

Our implementation

We have identified two different use cases for synonyms: equalities and placeholders. The first and most common use case is when you tell the search engine that several words must be considered equal, for example st and street in an address. The second use case, which we call a placeholder, is when you indicate that a specific token can be replaced by a set of possible words and that the token itself is not searchable. For example, the content  street could be matched by the queries 1st street or 2nd street but not the query number street.

For the first use case, we have added a support of synonyms that is compatible with prefix search and have implemented two different ways to do highlighting (controlled by thereplaceSynonymsInHighlight  query parameter):

  1. A mode where the original word that matched via a synonym is highlighted. For example if you have a record that contains black ipad 64GB  and a synonym black equals dark, then the following queries will fully highlight the black word : ipad d , ipad da , ipad dar & ipad dark. The typeahead search is working and the synonym expansion is fully highlighted: **black** **ipad** 64GB .
  2. A mode where the original word is replaced by the synonym, and the matched prefix is highlighted. For example ipad d  query will replace black by dark and will highlight the first letter of dark: **d**ark **ipad** 64GB. This method allows to fully explain the results when the original word can be safely replaced by the matched synonym.

For the second use case, we have added support for placeholders. You can add a specific token in your records that will be safely replaced by a set of words defined in your configuration. The highlighting mode that replaces the original word by the expansion totally makes sense here. For example if you have mission street  record with a placeholder = [ "1st", "2nd", ....] , then the query 1st missionstreet will replace by 1st  and will highlight all words: **1st mission street**.

We believe this is a better way to handle synonyms and we hope you will like it :) We would love to get your feedback and ideas for improvement on this feature! Feel free to contact us at hey(at)algolia.com.

Why JSONP is still Mandatory

At Algolia, we are convinced that search queries need to be sent directly from the browser (or mobile app) to the search-engine in order to have a realtime search experience. This is why we have developed a search backend that replies within a few milliseconds through an API that handles security when called from the browser.

Cross domain requests

For security reasons, the default behavior of a web browser is to block all queries that are going to a domain that is different from the website they are sent from. So when using an external HTTP-based search API, all your queries should be blocked because they are sent to an external domain. There are two methods to call an external API from the browser:

 JSONP

The JSONP approach is a workaround that consists of calling an external API  with a DOM <script>  tag. The <script> tag is allowed to load content from any domains without security restrictions. The targeted API needs to expose a HTTP GET endpoint and return Javascript code instead of the regular JSON data. You can use this jQuery code to dynamically call a JSONP URL:

$.getJSON( "http://api.algolia.io/1/indexes/users?query=test", function( data ) { .... }

In order to retrieve the API answer from the newly included JavaScript code, jQuery automatically appends a callback argument to your URL (for example &callback=method12 ) which must be called by the JavaScript code that your API generates.

This is what a regular JSON reply would look like:

{
  "results": [ ...]
}

Instead, the JSONP-compliant API generates:

method12({"results": [ ...]});

Cross Origin Resource Sharing

CORS (Cross Origin Resource Sharing) is the proper approach to perform a call to an external domain. If the remote API is CORS-compliant, you can use a regular XMLHttpRequest  JavaScript object to perform the API call. In practice the browser will first perform an HTTP OPTIONS request to the remote API to check which caller domains are allowed and if it is authorized to execute the requested URL.

For example here is a CORS request issued by a browser. The most important lines are the last two headers that specify which permissions are checked. In this case, the method is POST and the three specific HTTP headers that are requested.

OPTIONS http://latency.algolia.io/1/indexes/*/queries
> Host: latency.algolia.io
> Origin: http://demos.algolia.com
> Accept-Encoding: gzip,deflate,sdch
> Accept-Language: en-US,en;q=0.8,fr;q=0.6
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
> Accept: */*
> Referer: http://demos.algolia.com/eventbrite/
> Connection: keep-alive
> Access-Control-Request-Headers: x-algolia-api-key, x-algolia-application-id, content-type
> Access-Control-Request-Method: POST

The server reply will be similar to this one:

< HTTP/1.1 200 OK
< Server: nginx/1.6.0
< Date: Tue, 13 May 2014 08:33:55 GMT
< Content-Type: text/plain
< Content-Length: 0
< Connection: keep-alive
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, PUT, DELETE, POST, OPTIONS
< Access-Control-Allow-Headers: x-algolia-api-key, x-algolia-application-id, content-type
< Access-Control-Allow-Credentials: false < Expires: Wed, 14 May 2014 08:33:55 GMT
< Cache-Control: max-age=86400
< Access-Control-Max-Age: 86400

This answer indicates that this POST method can be called from any domain (Access-Control-Allow-Origin: * ) and with the requested headers.

CORS has many advantages. First, it allows access to a real REST API with all HTTP verbs (mainly GET, POST, PUT, DELETE) and it also allows to better handle errors in an API (bad requests, object not found, ...). The major drawback is that it is only supported by modern browsers (Internet Explorer ≥ 10, Firefox ≥ 3.5, Chrome ≥ 3, Safari ≥ 4 & Opera ≥ 12; Internet Explorer 8 & 9 provides partial support via theXDomainRequest  object).

Our initial conclusion

Because of the advantages of CORS in terms of error handling, we started with a CORS implementation of our API. We also added a specific support for Internet Explorer 8 & 9 using the  XDomainRequest  JavaScript object (they do not support XMLHttpRequest). The main difference is that XDomainRequest  does not support HTTP headers so we added another way to specify user credentials in the body of the POST request (it was initially only supported via HTTP headers).

We were confident that we were supporting almost all browsers with this implementation, as only very old browsers could cause problems. But we were wrong!

CORS problems

The reality is that CORS still causes problems, even with modern browsers. The biggest problem we have found was with some firewalls/proxies that refuse HTTP OPTIONS queries. We even found software on some computers that were blocking CORS requests, as the Cisco AnyConnect VPN client, which is widely used in the enterprise world. We have found this issue when a TechCrunch employee was not able to operate search on crunchbase.com because the AnyConnect VPN client was installed on his laptop.

Even in 2014 with a large majority of browsers supporting CORS, it is not possible to have perfect service quality with a CORS-enabled REST API!

The solution

Using JSONP is the only solution to ensure great compatibility with old browsers and handle problems with a misconfigured firewall/proxy. However, CORS offers the advantage of proper error-handling, so we do not want to limit ourselves to JSONP.

In the latest version of our JavaScript client, we decided to use CORS with a fallback on JSONP. At client initialization time, we check if the browser supports CORS and then perform an OPTIONS query to check that there is no firewall/proxy that blocks CORS requests. If there is any error we fallback on JSONP. All this logic is available in our JavaScript client without any API/code change for our customers.

Having CORS support with automatic fallback on JSONP is the best way we have found to ensure great service quality and to support all corner case scenarios. If you see any other way to do it, your feedback is very welcome.

Inside GrowthHackers.com's Implementation of Algolia

We interviewed Dylan La Com, Growth Product Manager at Qualaroo & GrowthHackers.com, about their Algolia implementation experience.

growthacker

What role did search play at GrowthHackers before the Algolia

implementation?

When we launched our community site GrowthHackers.com in October 2013, search was admittedly an afterthought for us. GrowthHackers is a social-voting site where marketers, founders, and product-people can share and discuss growth-related content. At launch, it was unclear what role search would have on the site. GrowthHackers is built on Wordpress, and with that comes Wordpress' standard search functionality. What WP search does is append an additional keyword or phrase parameter to its typical post query and load a new page with the results. WP search only indexed the outbound URLs of the articles our members submitted, and this made finding specific content difficult.

Why did you want to give search an update on GrowthHackers?

We started hearing about our lack of a solid search feature from some of our more active users. One of our members even put together a slide presentation to prove just how useless our search was [check it out here]. At the same time, GrowthHackers was becoming more than just a way to stay up-to-date on the best growth articles, it was becoming the place to get answers: an encyclopedia for growth-related information. Search volume at this time was peaking in the mid-hundreds per week. We needed a search feature that could support this evolving use-case.

Why did you choose Algolia?

We looked at several search solutions before trying Algolia, including Swiftype, WP Search (plugin), and Srch2. All are great solutions, but ultimately, we went with Algolia because they had the right mix of features: Their integration was simple, the documentation was thorough, and there were plenty of starter templates. I knew it was a good sign when, while looking their GitHub repository, I found they had a demo site built with search that worked very similar to how we hoped ours would work, complete with real-time results, typo-tolerance, and filters. The Algolia team was incredibly helpful getting us set up and was there each step of the way through the integration process, providing resources and best practices for creating a truly top-notch search experience.

Tell me a little about how the new search works.

Our primary use of Algolia is to store and index user submitted content, and provide real-time search in our growing database of growth-related articles, questions, videos and slides. The majority of what we index is article titles and URLs--strings which are generally small. Visitors to our site often come with specific growth-related questions and use our search to find answers quickly. For example, someone interested in learning best practices for running Twitter ads could type in "Twitter ad" and within milliseconds see dozens of articles and discussions related to maximizing ROI for Twitter ads. Using Algolia's admin dashboard, we're able to set ranking priorities based on the number of votes and comments of each article, and make sure the top results are the most relevant. So, the visitor who searches "Twitter ad" is shown articles with the highest mix of votes and comments. Algolia took the search ranking process and wrapped it in a clean and simple interface that allows anyone, regardless of their experience with search, to easily adjust and manipulate.

One of the challenges we faced during the integration process was understanding how to keep our main database synced and up to date with our Algolia index. User submitted content on GrowthHackers changes often as users interact with the content. Each post once submitted may receive upvotes and comments from members in the community. Each post also has a wiki-style summary field that can be edited by community members. Lastly, posts can have several states, including published, pending and trashed. In order to ensure our content on Algolia mirrored the content in our database, we set up a job queue and a cron process to periodically push updates to our Algolia index. This has been working quite well for us.

How has the new search impacted engagement?

We released the new search mid-February, and since the release we've seen search volume increase 4-5X. Of course there are several factors at play here, including increased traffic volume and better search bar placement, but it is clear that Algolia's search features have contributed to an impressive increase in search engagement. On average, visitors who utilize search view 2-3X more pages per session and spend 5-6X longer on the site than those who don't search. Algolia's analytics dashboard provides us with an incredible glimpse of visitor intent on our site by showing us the queries visitors are searching for, and trend lines to show popularity over time. With this data, we're able to better understand how our visitors want to use our site, and make better decisions about how to organize the content.

Moving forward, we're hoping to implement Algolia's search filters to provide even better ways to access content on our site. We're excited to have such a powerful tool in our stack and hope to experiment with new ways to provide search functionality throughout GrowthHackers.

Dealing with OpenSSL Heartbleed Vulnerability

Yesterday, the OpenSSL project released an update to fix a serious security issue. This vulnerability was disclosed in CVE-2014-0160 and is more widely known as the Heartbleed vulnerability. It allows an attacker to grab the content in memory on a server. Given the widespread use of OpenSSL and the versions affected, this vulnerability affects a large percentage of services on the internet.

Once the exploit was revealed, we responded immediately: All Algolia services were secured the same day, by 3pm PDT on Monday, April 7th. The fix was applied on all our API servers and our website. We then generated new SSL certificates with a new private key.

Our website is also dependent on Amazon Elastic Load Balance, which was affected by this issue and updated later on Tuesday, April 8th. We then changed the website certificate.

All Algolia servers are no longer exposed to this vulnerability.

Your credentials

We took the time to analyze the past activity on our servers and did not find any suspicious activity. We are confident that no credentials were leaked. However, given that this exploit existed in the wild for such a long time, it is possible that an attacker could have stolen API keys or passwords without our knowledge. As a result, we recommend that all Algolia users change the passwords on their accounts. We also recommend that you reset your Algolia administration API key, which you can do at the bottom of the "Credential" section in your dashboard. Be careful to update it everywhere you use it in your code (once you have patched your SSL library if you too are vulnerable).

Security at Algolia

The safety and security of our customer data are our highest priorities. We are continuing to monitor the situation and will respond rapidly to any other potential threats that may be discovered.

If you have any questions or concerns, please email us directly at security@algolia.com

Introducing Search Analytics: Know Your Users Better

This week we have released a much requested feature by our customers: analytics.

The importance of analytics to search

At Algolia, our goal is to revolutionize the way people search and access content inside the Web and mobile services. Think about Spotify, LinkedIn, Amazon: Everyone wants to find the right songs, people and products in just a couple keystrokes. Our challenge is to provide fast and meaningful access to all of this content via a simple search box. In March, we answered more than 200 million user queries for our customers on every continent.

Providing the right content through the right search and browsing experience is key. For our customers, understanding their users - what they like, what they want and when they want it -  is just as important, if not more. This is why we came up with this new analytics section, built on top of our API and available on our customers' online dashboards when they log in to their Algolia account. So what exactly do we track for you?

We describe here some of the top features that are now available to all our users.

Most popular queries

In this chart, we show which items were most queried. It would be useful, for example, to a procurement department for anticipating their  most frequently- searched products' inventory needs. And if you monetize your service through advertising, know what people are most interested in is especially valuable.

A new analytics feature supports the most popular queries.

Queries with no or a few results

Today, most services are simply clueless when it comes to what is missing in their content base. How do you know that your catalogue of products fits your users' expectations? Knowing whether or not you provide what your users need is critical for your business.

Search Analytics: Track top queries Algolia lets you determine which top queries have few or nonexistent results.

How does a query evolve over time?

Is Chanel more popular than Louis Vuitton in the morning or at night? Are bikes more popular in June or in December? With this new feature, you can now answer such questions for your own content by following the number of times a specific query is typed on an hourly basis.

Search Analytics: Track
popularity of a search query over
time Example: Search analytics lets you track the evolution of the query "louboutin" over 24 hours.

Which categories do people search the most?

When users type in a query, they often use categories to refine the results. We let you know which categories were the most frequently used for refinement. We even provide the most used combinations of categories (such as "dress" + "blue" + "size M"). It should help you understand how your users browse your content and has broader implications if the ergonomics of your app is optimized.

Search Analytics: Top categories used for filtering an
refinement Track which combinations of categories people search for the most.

These new analytics features are included in our existing plans at no extra cost. The number of days when our analytics tools are available vary based on the plan you choose. We hope you will like it, and we will be more than happy to read your feedback and feature requests!

Search