YaCy (pronounced "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks. Its core is a computer program written in Java distributed on several hundred computers, , so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database (so called index) which is shared with other YaCy-peers using principles of P2P networks. It is a free search engine that everyone can use to build a search portal for their intranet and to help search the public internet clearly.
Compared to semi-distributed search engines, the YaCy-network has a decentralised architecture. All YaCy-peers are equal and no central server exists. It can be run either in a crawling mode or as a local proxy server, indexing web pages visited by the person running YaCy on his or her computer. (Several mechanisms are provided to protect the user's privacy). Access to the search functions is made by a locally running web server which provides a search box to enter search terms, and returns search results in a similar format to other popular search engines.
YaCy is available on Windows, Mac and GNU/Linux.
YaCy search engine is based on four elements:
- A search robot that traverses from web page to web page and analyzes their content.
- Creates a reverse word index (RWI) i.e. each word from the RWI has its list of relevant URLs and ranking information. Words are saved in form of word hashes.
- Search and administration interface
- Made as a web interface provided by a local HTTP servlet with servlet engine.
- Data storage
- Used to store the reverse word index database utilizing a distributed hash table.
The information society of the 21st century is based on free access to all public information. There is a huge focus on transparency, accountability and accessibility of information. YaCy aims to enable this free access to information effectively and realistically. Therefore, while major search engines of the global corporations are closed systems and their search technology is not transparent and comprehensible, YaCy provides an open-source and free search solution. Everyone can see how information is obtained for the search engine and displayed to the user.
There is a lot of free content on the Internet, such as Wikipedia, free music, data under Creative Commons and other free use licenses, etc. This free content should not only be discoverable using proprietary search engines in an increasingly monopolistic Internet infrastructure because then the monopoly holders decide what information is visible. YaCy believes that free information is truly free if it can be accessed using free software and YaCy fills in the missing link between free information and the user, free search.
A decentralised search engine
The Internet was built on original philosophy of an all-to-all infrastructure. But lately only transmitter-receiver connections have flooded the realm of the World Wide Web. Ideally, each consumer of content on the Web should have the same opportunity to produce content as to consume it. YaCy's goal is to help producers and users of information on the Web operate independently of the centralised search technique by making all content open to all people.
Benefits of the YaCy philosophy
- Civil rights and privacy
- A central evaluation and monitoring of search queries is impossible.
- Data tracks can not be evaluated. In addition to the data protection and privacy situation, this is an economic factor in terms of industrial espionage.
- The operation of data centers with enormous power consumption (and sometimes their own power plants) for central web search could be removed. Distributed search requires only the computers of the searchers.
- All seekers have the same rights, such as when adding new content.
- The content of the search engine will be determined by the users, not by commercial aspects of the Web portal operator.
- Individualization of Relevance: everyone can assess the quality and importance of web pages by their own rules and adjust to their personal relevance as a ranking method (both popular and scientific).
- As there is no central server, the results cannot be censored easily, and the reliability is (at least theoretically) higher, because there's no single point of failure and the search index is stored redundantly.
- Because the engine is not owned by a company, there is no centralized advertising.
- Because of the design of YaCy, it can be used to index intranets or darknets, where Internet search engines do not or can not operate, including Tor, I2P or Freenet.
- It is possible to achieve a high degree of privacy.
- On every search YaCy fetches the pages provided in search results and verifies that they still contain the keywords requested by the user. This ensures that the pages that no longer contain the requested keywords are not displayed to the user, among other things.
- The YaCy protocol uses HTTP requests, which preserves transparency and discoverability, while aiding diagnosis and investigation. Performance can be increased to near that of binary-only protocols (like TCP & UDP, see Disadvantages section), with the use of compression, such as gzip.
- Built-in support for serving search results via OpenSearch
- There is no NAT traversal functionality built in.
- As there is no central server and the YaCy network is open to anyone, malicious peers are (theoretically) able to insert inaccurate or commercially biased search results. In theory no search result displayed to the user can be 'wrong' since all results are, if so configured, verified by downloading each page from the result set to see if the searched words actually exist on the page from the search result URL. However, YaCy uses a User agent string to identify itself, meaning a web server could send down different content to a YaCy crawler than to a normal visitor, but this is true for nearly any search engine.
- Result verification is done client-side on every search, which increases network traffic on the computer running YaCy and makes YaCy slower to display the search results than search engines such as Google. This behavior can be disabled, but that would make the search susceptible to Search spam|spam.
- The YaCy protocol uses Http request|HTTP-Requests, which can be slower than binary protocols.
- The ranking of sites is done on the YaCy client side (users are encouraged to run their own YaCy server, as using a local server is necessary to gain many of the benefits of YaCy). The ranking algorithms, although easily customized, do not have their workload distributed and are limited to the use of the YaCy word index and whatever analysis can be done on the object being ranked. Therefore, more complex ranking algorithms such as those used by Google (which analyse rank using a variety of contextual factors developed during content crawling) are not, yet, feasible in YaCy, placing limits on most users' means to retrieve more relevant results. However, it's possible to apply crowdsourced ranking to YaCy results using software such as Seeks.
YaCy as a search appliance: topic-oriented search and search engine for projects
- You can search for projects (a combination of wikis, forums and websites)
- It is a topic-oriented search engine (combine a search for several web pages from different domains into a single search portal)
- YaCy helps to preserve your anonymity when searching your things.
- If you run a YaCy peer, you have your own search engine. You can use it either to provide search functionality for your own search portal, or you can join a community of search engine peers to share your web index with the web index of other YaCy peer owners. If you search with YaCy your search requests are anonymous.
Privacy and security
- Your private search requests are never stored, monitored or evaluated for commercial purposes.
- If you are searching for terms related to product development and innovation, you can potentially give away information about your company activities. To maintain your business secrets, you need your own search engine (which can easily be created with YaCy).
- YaCy is a complete search appliance with user interface, index, administration and monitoring.
- YaCy harvests web pages with a web crawler. Documents are then parsed, indexed and the search index is stored locally. If your peer is part a peer network, then your local search index is also merged into the shared index for that network.
- A search is started then the local index contributes together with a global search index from peers in the YaCy search network.
YaCy platform architecture
YaCy uses a combination of techniques for the networking, administration, and maintenance of indexing the search engine including blacklisting, moderation, and communication with the community. Here is how YaCy performs these operations:
- Community components
- Web forum
- XML API
- Web Server
- Crawler with Balancer
- Peer-to-Peer Server Communication
- Content organization
- Blacklisting and Filtering
- Search interface
- Monitoring search results