A synonym by any other name from alt. labels to knowledge graphs gas vs diesel towing

###########

At the heart of the problem, the a synonym in your ‘synonyms.txt’ isn’t the same as a linguistic synonym. It’s better to think of the search engine’s synonym filter as a relevance engineer’s tool. A way of generating additional terms when another term is encountered. Such a tool can be leveraged to solve a variety of problems, and shouldn’t be used blindly.

From client conversations, what people mean when they say ‘my synonyms are broken’ usually means the search engine’s synonyms don’t match the use case they have in their head. In this article, we’ll enumerate the most common use cases in the field. electricity facts ks2 In future articles, we’ll discuss how to solve each use case using the search engine’s functionality. An Overview of Common Synonym Use Cases

In this article, we’ll define each of these. You’ll notice the nested nature of this diagram. Each inner use case is a simpler version of the outer use case. As you move out, complexity, sophistication, and power increase. The exception are shared contexts (such as with word2vec). As you’ll see terms can share contexts for many reasons unrelated to shared meaning. In these cases, the goal is to munge the data/process to approximate one of the other use cases (such as synonymy).

In the case of alternate terms, we want terms to receive exactly the same in scoring. A document containing ‘color’ is just as much about the search term ‘colour’’. As you might guess, alternate terms is the easiest use case. The default Elasticsearch/Solr synonym functionality ( SynonymQuery ) works pretty well for this use case. electricity facts label Synonyms (in the linguistic sense…)

To most, ‘palace’ has a different connotation than ‘castle’. Though we humans see them as ‘nearly the same meaning’. ‘Near’ depends on the search corpus, domain, user, and use cases. A historian sees a castle as a defensive structure, synonomous with a fortress. electricity outage sacramento To the historian, this is quite different from a palace, which is a fancy home for nobility. A tourist, however, sees them as nearly the same kind of place to visit. Maybe not exactly the same, but pretty close to interchangeable.

Many search teams throw these terms in, don’t think too deeply about how the search engine should work. Then they’re surprised that the search engine doesn’t prioritize them they way they think they should be prioritized. And sadly, every human being and corpus has subtly different senses of how/where these terms should be treated as more similar or less. Expected Search Engine Scoring of Synonyms

As we’ll see in a future blog article, this fits in with the general field of query expansion. In the query expansion literature, a users search term is expanded and scored relative to its relationship to the original term (via co-occurrence metrics). What I like about this is instead of using word2vec to churn out candidate synonyms, with a lot of data cleaning, this does the opposite. It starts with hypothesized synonyms to seed an exploration, then approximating how they relate via co-occurrence metrics. Taxonomies

As mentioned in the prior section, search teams will say they want the synonyms ‘closest’ to the user’s query scored higher than the synonyms ‘farthest’ away. A search for ‘dress trousers’ should return literal ‘dress trousers’ matches. But after that, types of dress trousers should be returned (khakis, slacks) followed perhaps by other kinds of ‘trousers’.

In addition to hypernymy/hyponomy – which relate terms by broader and narrower meaning – you often see meronomy (aka partonomy). In these cases, the hierarchy expresses ‘part of’ relationships. A feather is a part of a wing which is part of a bird. electricity bill calculator A dog is part of a pack. There’s an important distinction here, as meronomy doesn’t relate to meaning. But people do place meronyms in hierarchical taxonomies. youtube gas laws Expected Search Engine Scoring of Taxonomies – Nym Ordering

As a relevance engineer, you can decide how deep to get in the nym expansion (go all the way to great-grandparent hypernyms?) Indeed, a lot boils down to how the taxonomy is designed and/or generated – which itself is a skill and a discipline with its own conferences and specialists. I often say, for most search teams, a good taxonomy will have a much bigger impact than Learning to Rank ever will.

What if we want to move past just hierarchies – to allowing search system to reason about the search query and solve user’s problems? Dogs eat alpo and sometimes sadly squirrels and shoes. These facts might matter if we’re building a search system to help solve common pet problems expressed in the search query “help! My dog swallowed a shoe!”

Knowledge graphs are a set of facts. Facts as a graph of relationships between connected concepts, with each node a noun; each connecting edge a relationship (often a verb). For example, we know the fact ‘dogs bite cats’. We can represent that as a list of connections between concepts. Such a triple here would connect concepts ‘dog’ and ‘cat’ with the relationship ‘bites’ as below:

We might also have systems that can infer facts for our knowledge graph from our corpus, and bring that into search. Check out Max Irwin’s Haystack Talk on algorithms that allow that. gas in oil pan Now that Max is an OSC employee, perhaps he’ll write a blog post in this series touching on ontologies! A note on Solr’s Knowledge Graph and Elasitcsearch’s Graph Plugin

Just a quick point to clarify confusion. Some have been confused by Elasticsearch’s and Solr’s graph inference capabilities. These aren’t the same as a formal knowledge graph that we describe here (they lack verbs). Instead these are really interesting in their own right: learning what terms are connected based on co-ocurrences. YOU must decide what those co-occurences mean in your data/for your domain and use case. Expected Search Engine Scoring of Knowledge Graphs

These use cases are less about scoring individual documents, but correctly mapping the query into a graph query that can pull back a set of facts from the graph database. This is a slightly different, though very related, problem to search relevance ranking. We’ll discuss this in a future blog post. Shared Contexts (ie embeddings, word2vec, and the like…)

In the first example, occuring in the same container (doc/sentence/etc) acts as a shared context between ‘cat’ and ‘dog’. In the second example a term (‘vet’) acts as a shared context between cat and dog. If we notice that enough similar words group around ‘cat’ and ‘dog’, we make conclusions that cat and dog are ‘close’ in our corpus: they often share contexts.

Tools like word2vec, Latent Dirichlet Allocation, and Latent Semantic Analysis can be compute a per-word representation that encode the contexts that word occurs in. This representation, a dense vector, is built so that words with similar contexts cluster closely together (their vectors are geometrically close). Words with dissimilar contexts tend to be farther apart. electricity experiments When we represent an item as a dense vector, with the expectation it will cluster together with ‘similar’ things we refer to that vector as an embedding.

• Statements of Fact: Terms share contexts because they relate to each other through factual statements (such as described above using a knowledge graph). For example “Dog bites cat” or “Shark bites dog”. Here dog connects the term (a verb) ‘bites’ to ‘shark’. This also might occur a lot with adjectives – ‘red shoe’ and ‘red sky’. Sharing ‘red’ doesn’t mean ‘sky’ and ‘shoe’ are related terms

Using an embedding to emulate any of the use cases we described in this article is possible, but requires work. Indeed it’s very much an open research area! Work we won’t go into here. Many strategies are discussed in the upcoming book Deep Learning for Search by Tommaso Teofili and my Neural Search Frontier talk. And also Simon Hughes talk on Vectors in Search for how to use an embedding in a search engine like Solr. In conclusion… no cut and dry solutions

OK OK if I must. In many (most?) cases having a good taxonomy is often the best assets a search team can have. More important even than jumping on the Learning to Rank bandwagon. gas x coupon 2015 Not everyone needs a full knowledge graph, and usually synonyms don’t come with enough richness to express what search teams intend (many people’s ‘synonyms’ are really badly structured taxonomies). Even e-commerce search is about knowledge management! Is there much difference between an attentive, expert sales person and a librarian after all? So whether you intend to hand-craft a taxonomy or learn a taxonomy from your corpus or user queries, ‘Get thee to a Taxonomy!’ as The Bard says.