A look at search engines with their own indexes

A look at search engines with their own indexes - 15 minutes read

This is a cursory review of all the indexing search engines I have been able to find. The three dominant English search engines with their own indexesnote 1 are Google, Bing, and Yandex ( ). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY. With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they’d like more information. This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I’d especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer. I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use.

I discuss my motivation for making this page in the Rationale section.

I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn’t able to find many Spanish-language engines powered by their own crawlers.

I mention details like “allows site submissions” and structured data support where I can only to inform authors about their options, not as points in engines’ favor.

See the Methodology section at the bottom to learn how I evaluated each one.

These are large engines that pass all my standard tests and more.

Google, Bing, and Yandex support structured data such as microformats1, microdata, RDFa, Open Graph markup, and JSON-LD. Yandex’s support for microformats1 is limited; for instance, it can parse metadata for organizations but not people. Open Graph and Schema.org are the only supported vocabularies I’m aware of. Mojeek is evaluating structured data; it’s interested in Open Graph and Schema.org vocabularies.

These engines pass most of the tests listed in the “methodology” section. All of them seem relatively privacy-friendly.

Yep supports Open Graph and some JSON-LD at the moment. A look through the source code for Alexandria and Gigablast didn’t seem to reveal the use of any structured data.

These engines fail badly at a few important tests. Otherwise, they seem to work well enough.

seekport: The interface is in German but it supports searching in English just fine. The default language is selected by your locale. It’s really good considering its small index; it hasn’t heard of less common terms (e.g. “Seirdy”), but it’s able to find relevant results in other tests. Exalead: slow, quality is hit-and-miss. Its indexer claims to crawl the DMOZ directory, which has since shut down and been replaced by the Curlie directory. No relevant results for “Oppenheimer” and some other history-related queries. Allows submitting individual URLs for indexing, but requires solving a Google reCAPTCHA and entering an email address. ExactSeek: small index, disproportionately dominated by big sites. Failed multiple tests. Allows submitting individual URLs for crawling, but requires entering an email address and receiving a newsletter. Webmaster tools seem to heavily push for paid options. It also powers SitesOnDisplay and Blog- search.com. Infotiger: A small index that seems to find relevant results. It allows site submission for English and German pages. It also features a “similarity” search to query pages similar to a given link, with mixed results. Burf.co: Very small index, but seems fine at ranking more relevant results higher. Allows site submission without any extra steps. Entfer: a newcomer that lets registered users upvote/downvote search results to customize ranking. Doesn’t offer much information about who made it. Its index is small, but it does seem to return results related to the query. Siik: Lacks contact info, and the ToS and Privacy Policy links are dead. Seems to have PHP errors in the backend for some of its instant-answer widgets. If you scroll past all that, it does have web results powered by what seems to be its own index. These results do tend to be somewhat relevant, but the index seems too small for more specific queries. websearchengine.org and tuxdex.com: Both are run by the same people, powered by their inetdex.com index. Searches are fast, but crawls are a bit shallow. Claims to have an index of 10 million domains, and not to use cookies. Meorca: A UK-based search engine that claims not to “index pornography or illegal content websites”. It also features an optional social network (“blog”). Discovered in the seirdy.one access logs. ChatNoir: An experimental engine by researchers that uses the Common Crawl index. The engine is open source. See the announcement on the Common Crawl mailing list (Google Groups). Secret Search Engine Labs: Very small index with very little SEO spam; it toes the line between a “search engine” and a “surf engine”. It’s best for reading about broad topics that would otherwise be dominated by SEO spam, thanks to its CashRank algorithm. Allows site submission.

Results from these search engines don’t seem at all useful.

Engines in this category fall back to GBY when their own indexes don’t have enough results. As their own indexes grow, some claim that this should happen less often.

Brave Search: Many tests (including all the tests I listed in the “Methodology” section) resulted results identical to Google, revealed by a side-by-side comparison with Google, Startpage, and a Searx instance with only Google enabled. Brave claims that this is due to how Cliqz (the discontinued engine acquired by Brave) used query logs to build its page models and was optimized to match Google.note 7 The index is independent, but optimizing against Google resulted in too much similarity for the real benefit of an independent index to show. Furthermore, many queries have Bing results mixed in; users can click an “info” button to see the percentage of results that came from its own index. The independent percentage is typically quite high (often close to 100% independent) but can drop for advanced queries. Plumb: Almost all queries return no results; when this happens, it falls back to Google. It’s fairly transparent about the fallback process, but I’m concerned about how it does this: it loads Google’s Custom Search scripts from onto the page to do a client-side Google search. This can be mitigated by using a browser addon to block from loading any scripts. Plumb claims that this is a temporary measure while its index grows, and they’re planning on getting rid of this. Allows submitting URLs, but requires solving an hCaptcha. This engine is very new; hopefully as it improves, it could graduate from this section. Its Chief Product Officer previously founded the Gibiru search engine which shares the same affiliates and (for now) the same index; the indexes will diverge with time. Neeva: Combines Bing results with results from its own index. Bing normally isn’t okay with this, but Neeva is one of few exceptions. As of right now, results are mostly identical to Bing but original links not found by Bing frequently pop up. Long and esoteric queries are less likely to feature original results. Requires signing up with an email address or OAuth to use, and offers a paid tier with additional benefits. Qwant: Qwant claims to use its own index, but it still relies on Bing for most results. It seems to be in a position similar to Neeva. Try a side-by-side comparison to see if or how it compares with Bing. Kagi Search: The most interesting entry in this category, IMO. Like Neeva, it requires an account; it will eventually require payment. It’s powered by its own Teclis index (Teclis can be used independently; see the non-commercial section below), and claims to also use results from Google and Bing. The result seems somewhat unique: I’m able to recognize some results from the Teclis index mixed in with the mainstream ones. In addition to Teclis, Kagi’s other products include the Kagi.ai intelligent answer service and the TinyGem social bookmarking service, both of which play a role in Kagi.com in the present or future.

These indexing search engines don’t have a Google-like “ask me anything” endgame; they’re trying to do something different. You aren’t supposed to use these engines the same way you use GBY.

These engines try to find a website, typically at the domain-name level. They don’t focus on capturing particular pages within websites.

Kozmonavt: The best in this category. Has a small but growing index of over 8 million sites. If I want to find the website for a certain project, Kozmonavt works well (provided its index has crawled said website). It works poorly for learning things and finding general information. I cannot recommend it for anything serious since it lacks contact information, a privacy policy, or any other information about the org/people who made it. Discovered in the seirdy.one access logs. search.tl: Generalist search for one at a time (defaults to .com). I’m not sure why you’d want to always limit your searches to a single TLD, but now you can.note 8 There isn’t any visible UI for changing the TLD for available results; you need to add/change the URL parameter. For example, to search .org sites, append to the URL. It seems to be connected to Amidalla. Amidalla allows users to manually add URLs to its index and directory; I have yet to see if doing so impacts search.tl results. Thunderstone: A combined website catalog and search engine that focuses on categorization. Its about page claims: We continuously survey all primary COM, NET, and ORG web-servers and distill their contents to produce this database. This is an index of sites not pages. It is very good at finding companies and organizations by purpose, product, subject matter, or location. If you’re trying to finding things like ‘BillyBob’s personal beer can page on AOL’, try Yahoo or Dogpile. This seems to be the polar opposite of the engines in the “small or non-commercial Web” category. sengine.info: only shows domains, not individual pages. Developed by netEstate GmbH, which specializes in content extraction for inprints and job ads. Also has a German-only version available. Discovered in my access logs. Gnomit: Allows single-keyword queries and returns sites that seem to cover a related topic. I actually kind of enjoy using it; results are old (typically from 2009) and a bit random, but make for a nice way to discover something new. For instance, searching for “IRC” helped me discover new IRC networks I’d never heard of.

I’m unable to evaluate these engines properly since I don’t speak the necessary languages. English searches on these are a hit-or-miss. I might have made a few mistakes in this category.

Some search engines are integrated into other appliances, but don’t have a web portal.

These engines were originally included in the article, but have since been discontinued.

Two engines were excluded from this list for having a far-right focus.

One engine was excluded because it seems to be built using cryptocurrency in a way I’d rather not support.

Some fascinating little engines seem like hobbyist proofs-of-concept. I decided not to include them in this list, but watch them with interest to see if they can become something viable.

Google, Microsoft (the company behind Bing), and Yandex aren’t just search engine companies; they’re content and ad companies as well. For example, Google hosts video content on YouTube and Microsoft hosts social media content on LinkedIn. This gives these companies a powerful incentive to prioritize their own content. They are able to do so even if they claim that they treat their own content the same as any other: since they have complete access to their search engines’ inner workings, they can tailor their content pages to better fit their algorithms and tailor their algorithms to work well on their own content. They can also index their own content without limitations but throttle indexing for other crawlers.note 9

One way to avoid this conflict of interest is to use search engines that aren’t linked to major content providers; i.e., use engines with their own independent indexes.

There’s also a practical, less-ideological reason to try other engines: different providers have different results. Websites that are hard to find on one search engine might be easy to find on another, so using more indexes and ranking algorithms results in access to more content.

No search engine is truly unbiased. Most engines’ ranking algorithms incorporate a method similar to PageRank, which biases them towards sites with many backlinks. Search engines have to deal with unwanted results occupying the confusing overlap between SEO spam, shock content, and duplicate content. When this content’s manipulation of ranking algos causes it to rank high, engines have to address it through manual action or algorithm refinement. Choosing to address it through either option, or choosing to leave it there for popular queries after receiving user reports, reflects bias. The best solution is to mix different ranking algorithms and indexes instead of using one engine for everything.

Engines in this list should have their own indexes built primarily by web spiders. They should not be limited to a set of domains hand-picked by the engine creators.

I’m willing to make one exception: engines in the “non-generalist” section may use indexes primarily made of user-submitted sites, rather than focusing primarily on sites discovered organically through crawling. I’m not willing to budge on the “no hand-picked domains” rule.

I only consider search engines that focus on link results for webpages. Image search engines are out of scope, though I might consider some other engines for non-generalist search (e.g., Semantic Scholar finds PDFs rather than webpages).

I focused almost entirely on “organic results” (the classic link results), and didn’t focus too much on (often glaring) privacy issues, “enhanced” or “instant” results (e.g. Wikipedia sidebars, related searches, Stack Exchange answers), or other elements.

I compared results for esoteric queries side-by-side; if the first 20 results were (nearly) identical to another engine’s results (though perhaps in a slightly different order), they were likely sourced externally and not from an independent index.

I tried to pick queries that should have a good number of results and show variance between search engines. An incomplete selection of queries I tested:

Some less-mainstream engines have noticed this article, which is great! I’ve had excellent discussions with people who work on several of these engines. Unfortunately, this article’s visibility also incentivizes some engines to optimize specifically for any methodology I describe. I’ve addressed this by keeping a long list of test queries to myself. The simple queries above are a decent starting point for simple quick evaluations, but I also test for common search operators, keyword length, and types of domain-specific jargon. I also use queries designed to pull up specific pages with varying levels of popularity and recency to gauge the size, scope, and growth of an index.

Professional critics often work anonymously because personalization can damage the integrity of their reviews. For similar reasons, I attempt to try each engine anonymously at least once by using a VPN and/or my standard anonymous setup: an amnesiac Whonix VM with the Tor Browser. I also often test using a fresh profile when travelling, or via a Searx instance if it supports a given engine. When avoiding personalization, I use “varied” queries that I don’t repeat verbatim across search engines; this reduces the likelihood of identifying me. I also attempt to spread these tests out over time so admins won’t notice an unusual uptick in unpredictable and esoteric searches. This might seem overkill, but I already regularly employ similar methods for a variety of different scenarios.

I didn’t try to avoid personalization when testing engines that require account creation. Entries in the “hit-and-miss” and “unusable” sections got less attention: I didn’t spend a lot of effort tracking results over time to see how new entries got added to them.

I avoided “natural language” queries like questions, focusing instead on keyword searches and search operators. I also mostly ignored infoboxes (also known as “instant answers”).

What I learned by building this list has profoundly changed how I surf.

Using one engine for everything ignores the fact that different engines have different strengths. For example: while Google is focused on being an “answer engine”, other engines are better than Google at discovering new websites related to a broad topic. Fortunately, browsers like Chromium and Firefox make it easy to add many search engine shortcuts for easy switching.

When talking to search engine founders, I found that the biggest obstacle to growing an index is getting blocked by sites. Cloudflare is one of the worst offenders. Too many sites block perfectly well-behaved crawlers, only allowing major players like Googlebot, BingBot, and TwitterBot; this cements the current duopoly over English search and is harmful to the health of the Web as a whole.

Too many people optimize sites specifically for Google without considering the long-term consequences of their actions. One of many examples is how Google’s JavaScript support rendered the practice of testing a website without JavaScript or images “obsolete”: almost no non-GBY engines on this list are JavaScript-aware.

When building webpages, authors need to consider the barriers to entry for a new search engine. The best engines we can build today shouldn’t replace Google. They should try to be different. We want to see the Web that Google won’t show us, and search engine diversity is an important step in that direction.

Try a “bad” engine from lower in the list. It might show you utter crap. But every garbage heap has an undiscovered treasure. I’m sure that some hidden gems you’ll find will be worth your while. Let’s add some serendipity to the SEO-filled Web.

Some of this content came from the Search Engine Map and Search Engine Party. A few web directories also proved useful.

also gave me some helpful information about GBY which I included in the “Rationale” section. He’s written more about big tech in the Gigablast blog.

by is a great post on alternative search engines. He also gave me some useful details about Seznam, Naver, Baidu, and Goo.

Source: Seirdy.one

Powered by NewsAPI.org

Immanuel How-tos

54 views

0 points

Submitted almost 2 years ago