Query Analysis & Clustering
Discover hidden patterns in user search behavior by automatically grouping similar queries.
From Raw Queries to Actionable Insights
A raw query log is full of variations, typos, and different phrasing for the same intent. Our query clustering feature uses a powerful and efficient algorithm to cut through the noise and group semantically similar queries, such as "i phone" and "iphone," or "mangetout" and "mange tout".
The Clustering Pipeline
Bigram Transformation
Each query is broken down into a set of bigrams (overlapping pairs of characters). For example, “apple” becomes {“ap”, “pp”, “pl”, “le”}. This helps find similarities even with typos.
MinHash Fingerprinting
A compact "fingerprint" is calculated for each set of bigrams using the MinHash algorithm. Strings with similar bigram sets will produce very similar fingerprints.
Locality-Sensitive Hashing (LSH)
To avoid comparing every query to every other query, we use LSH. This technique places similar fingerprints into the same "buckets" with high probability, dramatically speeding up the process. It's like a significantly faster version of vector search for this specific task.
Final Clustering
The system then runs a more precise (but slower) fuzzy matching comparison only on the small groups of candidates identified by LSH. The result is a clean set of clusters, each containing similar queries.