Dev:API
Aus YaCyWiki
Inhaltsverzeichnis |
Introduction
Besides the web interface, YaCy offers a rich XML and JSON based API for interaction. Some of these interfaces can also be accessed via html, and these pages are integrated in YaCy web interface. When you access such a page, a 'API' tooltip icon appears on the right upper corner of the web page, and a mouseover shows a short introduction of the API. The API icon itself links to the XML, JSON or similar API file that presents the shown data in annotated form. Please note that these tooltips and the underlying link to the API path change every time you navigate to another YaCy page, even if the icon looks the same, it will always link to the data that you just see at the web page.
API reference
| |
| peer access statistics | |
| YaCy blog | |
| show and edit crawl profiles | |
| peer and networc statistics | |
| peer memory status | |
| peer status of busy queues | |
| single url crawl start with immediate confirmation | |
| peer steering: shutdown, restart ,pause/resume crawls | |
| update peer | |
| view peer profile | |
| YaCy search returning xml or json results | |
| |
| |
| : | |
| : | |
| YaCy SVN version | |
| |
| crawling information for single url | |
| |
| |
| |
| |
(note that up to YaCy 0.7 all paths beginning with /api/ had been located at /xml/. From 0.71 all api paths start with /api/ as listed here)
Understanding the YaCy data format
As with all REST-based services, things start with an HTTP request to a YaCy applet. This request contains a query with one or more input parameters and the server replies with an Atom, RSS, JSON or HTML-formatted response, suitable for parsing in any XML/JSON-aware client. For example the REST method http://localhost:8080/Network.xml returns a collection of network information regarding the queried peer. If used with the optional paramter ?page=1 a list of all active peers is returned. Depending on the applet called the result is delivered as XML or JSON. The raw XML response to this method (which you can view in the source code of the resulting page) contains detailed information on these peers, and might look something like
<?xml version="1.0" ?>
- <peers>
- <active>
<count>72</count>
<links>719118988</links>
<words>61717251</words>
</active>
- <passive>
<count>266</count>
<links>917500019</links>
<words>276274016</words>
</passive>
- <potential>
<count>445</count>
<links>365033572</links>
<words>92047543</words>
</potential>
- <all>
<count>783</count>
<links>2001652579</links>
<words>430038810</words>
</all>
- <your>
<name>dlc-am2</name>
<hash></hash>
<type>senior</type>
<version>0.93006636</version>
<utc>+0100</utc>
<uptime>0 days 04:19</uptime>
<links>82829676</links>
<words>3481412</words>
<rurls>0</rurls>
<acceptcrawl>1</acceptcrawl>
<acceptindex>1</acceptindex>
<acceptranking>1</acceptranking>
<sentwords>176287214</sentwords>
<senturls>420220612</senturls>
<receivedwords>629126201</receivedwords>
<receivedurls>66745289</receivedurls>
<ppm>57</ppm>
<qph>0.05</qph>
<seeds>71</seeds>
<connects>3</connects>
<location>Europe/de</location>
<seedurl />
</your>
- <cluster>
<ppm>-1176</ppm>
<qph>3</qph>
</cluster>
</peers>
An element within the <peers> entry contains detailed information about the network and the queried peer; in this example network 'freeworld' and peer 'dlc-am2'. Not all YaCy functions are publicly accessible in this manner. Certain functions, specifically functions that read restricded peer data, modify data on the peer or network, or adding/modifying crawl-jobs are only accessible with a given authorisation. To get access to these API functions use the HTTP-basic auth method to send your YaCy account information to the queried peer.
Retrieving peer status information
The example above showed how to retrieve information from a peer by simply calling the appropriate applet and encoding the delivered xml. The easiest way to explore other API calls is to perform the desired action in theYaCy admin interface and use the same parameters while calling the rss or xml applet ie: Network.xml instead of Network.html. Most actions that had been issued on the YaCy interface to change the configuration or to request crawl actions can be examined on page Table_API_p.html.
After having received the query results the delivered xml or json must be converted into a SimpleXML object or Array. The client then iterates over the elements in the response, processing each one using a foreach() loop and retrieving the information sent by the peer.
Heres how some sample peer information is retrieved using PHP or similar languages for web applications.
Handling XML with PHP5
Open a connection to the desired peer and send a http request. A PHP5 class Dev:YaCyAPIforPHP is available for simple handling of requests to one or multiple YaCy peers.
A native http request could be handled by cURL like shown in this example:
<?php // method using native php-curl $YaCyURL="http://mypeer.tld:8080/"; $cu=$YaCyURL."Status.html"; $queryServer = curl_init($cu); curl_setopt($queryServer, CURLOPT_HEADER, 0); curl_setopt($queryServer, CURLOPT_RETURNTRANSFER, 1); curl_setopt($queryServer, CURLOPT_USERPWD,$appID); $results = curl_exec($queryServer); curl_close($queryServer); ?>
- The peers friendly name is stored in the <your> node collection, the sample accesses this node collection as yourpeer and stores the information like name or hash in yourpeer->name or yourpeer->hash
- The networks URL count is stored in the <all> node collection, the sample accesses this node collection as allpeers and stores the information in allpeers->count'
<?php
//method using YaCyapi.php
require 'YaCyAPI2.php';
// start the class
search = new YaCyAPI();
$results = $search->peerCommand("Network.xml");
//now we have xml, put it in a simple array
$resultarray=xml2array($results); #convert to php-array
//get items
$yourpeer=$resultarray['peers']['your'];
$peername=$yourpeer['name']
$peerhash=$yourpeer['hash']
//
$allpeers=$resultarray['peers']['all'];
$urlcount=$allpeers['count']
?>
The returned XML string is now converted to an array (xml2array).
This example is calling Network.xml with the page parameter to retrieve information about all peers in the queried network.
<?php
$results = $search->peerCommand("Network.xml?page=1");
//now we have xml, put it in a simple array
$resultarray=xml2array($results);;
//get items only
$items=$resultarray['peers']['peer'];
if ($items)
{
echo "<h1>Active Peers</h1>";
echo "<table>";
foreach ($items as $item)
{
if ($tr=="ffffff") {$tr="aaaaaa";} else {$tr="ffffff";}
echo "<tr bgcolor=#".$tr.">";
echo "<td>".$item['hash']."</td>";
echo "<td>".$item['fullname']."</td>";
echo "<td>".$item['type']."</td>";
echo "<td>".$item['version']."</td>";
echo "<td>".$item['ppm']."</td>";
echo "<td>".$item['qph']."</td>";
echo "<td>".$item['uptime']."</td>";
echo "<td>".$item['links']."</td>";
echo "<td>".$item['words']."</td>";
echo "<td>".$item['rurls']."</td>";
echo "<td>".$item['lastseen']."</td>";
echo "<td>".$item['sendWords']."</td>";
echo "<td>".$item['receivedWords']."</td>";
echo "<td>".$item['sendURLs']."</td>";
echo "<td>".$item['receivedURLs']."</td>";
echo "<td>".$item['direct']."</td>";
echo "<td>".$item['acceptcrawl']."</td>";
echo "<td>".$item['dhtreceive']."</td>";
echo "<td>".$item['rankingreceive']."</td>";
echo "<td>".$item['location']."</td>";
echo "<td>".$item['seedurl']."</td>";
echo "<td>".$item['age']."</td>";
echo "<td>".$item['seeds']."</td>";
echo "<td>".$item['connects']."</td>";
echo "</tr>";
}
echo "</table>";
}
?>
Handling JSON with PHP5
Some applets could be called to deliver JSON instead of XML. Results are delivered a bit faster and most parsers are able to decode returned data quicker so this format should be preferred to speed up things.
Handling XML or JSON with Ruby on Rails
The ruby on rail gem [httparty] offers great flexibility when handling with REST based applications. It is used in this example doing a quick search for 25 global results querying for 'test' and echos the links found.
require 'httparty'
class YaCy
include HTTParty
format: xml
base_uri 'http://localhost:8080'
def self.search(q)
return get('yacysearch.rss?', :query => {
:query => q,
:resource => 'global',
:verify => 'false',
:maximumrecords => '25'
end
end
begin
channel = YaCy.search('test')
channel['channels'].each do |item|
puts item['link']
end
rescue
p 'oops nothing found'
end
Handling XML or JSON with perl
For perl a library [Ismael] is available to handle request and returned results.
Performing a search query
Understanding YaCy result data
Calling the applet Yacysearch.rss will return a XML-formatted list of search results and some additional information regarding the search that may look like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/yacysearch.xsl' version='1.0'?>
<rss version="2.0"
xmlns:yacy="http://www.yacy.net/"
xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
xmlns:media="http://search.yahoo.com/mrss/"
xmlns:atom="http://www.w3.org/2005/Atom">
<!-- YaCy Search Engine; http://yacy.net -->
<channel>
<title>dulcedoSearch</title>
<description>Search for yacy</description>
<link>http://localhost:8080/yacysearch.html?query=yacy&resource=global&contentdom=text&verify=true</link>
<image>
<url>http://localhost:8080/env/grafics/yacy.gif</url>
<title>Search for yacy</title>
<link>http://localhost:8080/yacysearch.html?query=yacy&resource=global&contentdom=text&verify=true</link>
</image>
<opensearch:totalResults>145.650</opensearch:totalResults>
<opensearch:startIndex>0</opensearch:startIndex>
<opensearch:itemsPerPage>10</opensearch:itemsPerPage>
<atom:link rel="related" href="opensearchdescription.xml" type="application/opensearchdescription+xml"/>
<opensearch:Query role="request" searchTerms="yacy" />
Search Results are stored in the 'items' childgroup of the rss-feed channel
<item> <title>YaCyWeb.de - unzensierte Suchmaschine - uncensored search engine - YaCy (not YaCi)</title> <link>http://yacyweb.de/</link> <description>YaCyWeb.de - unzensierte Suchmaschine - uncensored search engine - <b>YaCy</b> (not YaCi)</description> <pubDate>Wed, 14 Oct 2009 02:00:00 +0200</pubDate> <yacy:size>11667</yacy:size> <yacy:sizename>11 kbyte</yacy:sizename> <yacy:host>yacyweb.de</yacy:host> <yacy:path>/</yacy:path> <yacy:file>/</yacy:file> <guid isPermaLink="false">g_L5h78wvMRA</guid> </item>
Start a YaCy search
A search is started by calling the YaCy applet Yacysearch wich could be called to deliver HTML, XML(.rss) or JSON containing informations about the found URLs and statistics.
// using YaCyapi.php
$search->query($searchWeb)
->setSources('Web')
->setFormat('xml');
->setOptions('startRecord=21');
$results = $search->search();
or native php:
//open connection to peer $YaCyURL="http://mypeer.tld:8080/"; $cu=$YaCyURL."Yacysearch.rss"; $cu=$cu."?query=yacy"; $cu=$cu."&maximumRecords=10"; $cu=$cu."&startRecord=21"; $queryServer = curl_init($cu); curl_setopt($queryServer, CURLOPT_HEADER, 0); curl_setopt($queryServer, CURLOPT_RETURNTRANSFER, 1); curl_setopt($queryServer, CURLOPT_USERPWD,$appID); $results = curl_exec($queryServer); curl_close($queryServer);
and continue with results
//now we have xml/json, put it in a simple array
$resultarray=xml2array($results); #, $get_attributes = 1, $priority = 'tag');
//get items
$items=$resultarray['rss']['channel']['item'];
if ($items)
{
foreach ($items as $item)
{
echo "<p><a href=".$item['link'].">".$item['title']."</a>";
echo "<br>"$item['description']."</p>";
}
} else {
echo "no results";
}
Using additional parameters
For a more precise search or to select result ranges, some of these parameters can be added to a search query, otherwise their default values are used.
| query = | string of space separated search terms
if search contains the keyword RECENT YaCy sorts the search result by date |
| contentdom =text |
|
| resource = local |
|
| urlmaskfilter = | RegExp to limit the search |
| prefermaskfilter = | RegExp for prefering results after search |
| verify =true |
|
| maximumRecords =10 | number of items YaCy should return - queries without authentication are limited to 10 results |
| startRecord = | number of first record to return e.g. for maximumRecords=10 and startRecord=11 YaCy returns results 11-20 |
| lr = | desired language e.g. lr=lang_en |
| meancount = | number of maximum alternative queries for 'did you mean?' |
| nav = |
|
| display = | (obsolete?)
|
| Enter =search |
|
| constraint = | only index pages: value = AQAAAA |
| former = | |
| count = | deprecated see maximumRecords |
| offset = | deprecated see startRecord |
| indexof = | identical to constraint=AQAAAA |
TODO: dev: please compare/correct with Dev:APIyacysearch (also incomplete) to join articles
The native PHP example below is showing results 21 to 30 from a query for 'yacy'
<?php
//open connection to peer
$YaCyURL="http://mypeer.tld:8080/";
$cu=$YaCyURL."Yacysearch.rss";
$cu=$cu."?query=yacy";
$cu=$cu."&maximumRecords=10";
$cu=$cu."&startRecord=21";
$queryServer = curl_init($cu);
curl_setopt($queryServer, CURLOPT_HEADER, 0);
curl_setopt($queryServer, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($queryServer, CURLOPT_USERPWD,$appID);
$results = curl_exec($queryServer);
curl_close($queryServer);
//now we have xml/json, put it in a simple array
$resultarray=xml2array($results);
//item childgroup
$items=$resultarray['rss']['channel']['item'];
if ($items)
{
foreach ($items as $item)
{
echo "<a href=".$item['link'].">".$item['title']."</a>";
}
} else {
echo "no results";
}
?>
Another example for an enhanced search showing 5 results beginning at 50 for 'yacy', but limit to PNG-images and prefer results from yacy.net. Search should be speeded up a bit so verify is set false and all peers should be asked so 'resource' is 'global'.
http://localhost:8080/yacysearch.rss?query=yacy&contentdom=text&maximumRecords=5&startRecord=50&verify=false&resource=global&urlmaskfilter=png&prefermaskfilter=yacy.net
Working with search statistics and navigators
Managing crawl jobs
Understanding YaCy crawl profiles
Each YaCy crawl job has its own profile to store information to ensure proper handling of crawled URLs. It is created at crawl start, will be set as terminated if a crawl is considered to be finished, and may also be edited or deleted while the crawl is running. To start a new crawl and create its profile following parameters are needed
| crawlingMode = |
|
| crawlingURL = | |
| sitemapURL = | |
| crawlingFile = | |
| crawlingDepth = | This defines how often the Crawler will follow links embedded in websites. A minimum of 0 is recommended and means that the page set as crawling URL, sitemap orfile will be added to the index, but no linked content is indexed. 2-4 is good for normal indexing. Be careful with the depth, consider a branching factor of average 20; A prefetch-depth of 8 would index 25.600.000.000 pages, maybe this is the whole WWW. |
| mustmatch = | The filter is a regular expression that must match with the URLs which are used to be crawled; default is 'catch all'. Example: to allow only urls that contain the word 'science', the filter is set to '.*science.*'. An automatic domain-restriction can be used to fully crawl a single domain. |
| range = | |
| mustnotmatch = | This filter must not match to allow that the page is accepted for crawling. The empty string is a never-match filter which should do well for most cases. |
| crawlingIfOlderCheck = | If this option is used, web pages that are already existent in the peers database are crawled and indexed again. It depends on the age of the last crawl if this is done or not: if the last crawl is older than the given date, the page is crawled again, otherwise it is treated as 'double' and not loaded or indexed again. |
| crawlingIfOlderNumber = | |
| crawlingIfOlderUnit = | |
| crawlingDomFilterCheck = | This option will automatically create a domain-filter which limits the crawl on domains the crawler will find on the given depth. You can use this option i.e. to crawl a page with bookmarks while restricting the crawl on only those domains that appear on the bookmark-page. The adequate depth for this example would be 1. The default value 0 gives no restrictions. |
| crawlingDomFilterDepth = | |
| crawlingDomMaxCheck = | The maxmimum number of pages that are fetched and indexed from a single domain can be limited with this option. If combined with the 'Auto-Dom-Filter' the limit is applied to all the domains within the given depth. Domains outside the given depth are then sorted-out anyway. |
| crawlingDomMaxPages = | |
| crawlingQ = | A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that is accessed with URLs containing question marks. |
| storeHTCache = | This option is used by default for proxy prefetch, but is not needed for explicit crawling. |
| cachePolicy = | The caching policy states when to use the cache during crawling:
|
| indexText = | |
| indexMedia = | |
| crawlOrder = | If checked, the crawler will contact other peers and use them as remote indexers for your crawl. If crwling results are needed locally, this switch should be set to false. Only senior and principal peers can initiate or receive remote crawls. A YaCyNews message will be created to inform all peers about a global crawl, so they can omit starting a crawl with the same start point. |
| intention = | |
| xsstopw = | This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file yacy.stopwords from indexing, this hast to be set true. |
| xdstopw = | |
| xpstopw = |
Starting new crawls
A crawl is started by a call to .html and submitting the paramters shown above.
http://localhost:8080/Crawler_p.html?crawlingDomMaxPages=10000&range=wide&intention=&sitemapURL=&crawlingQ=on&crawlingMode=url&crawlingURL=http://vip.asus.com/forum/default.aspx%3FSLanguage%3Den-us&crawlingFile=&mustnotmatch=&crawlingFile%24file=&crawlingstart=Neuen%20Crawl%20starten&mustmatch=.*&createBookmark=on&bookmarkFolder=/crawlStart&xsstopw=on&indexMedia=on&crawlingIfOlderUnit=hour&cachePolicy=iffresh&indexText=on&crawlingIfOlderCheck=on&bookmarkTitle=&crawlingDomFilterDepth=1&crawlingDomFilterCheck=on&crawlingIfOlderNumber=1&crawlingDepth=4
Retrieving crawl information
Another useful information provided by YaCy peers are crawl profiles, collections of information containing start-url, crawling-depth and filters which specify each running crawl job.
This native PHP example shows how to request a list of all crawl profiles a peer has loaded.
<?php
$command="CrawlProfileEditor_p.xml";
//open connection to peer
$YaCyURL="http://mypeer.tld:8080/";
$cu=$YaCyURL.$command;
$queryServer = curl_init($cu);
curl_setopt($queryServer, CURLOPT_HEADER, 0);
curl_setopt($queryServer, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($queryServer, CURLOPT_USERPWD,$appID);
$results = curl_exec($queryServer);
curl_close($queryServer);
//parse xml...
$resultarray=xml2array($results);
//get items only
$items=$resultarray['crawlProfiles']['crawlProfile'];
if ($items)
{
echo "<h1>Crawl Profiles</h1>";
echo "<table>";
foreach ($items as $item)
{
if ($tr=="ffffff") {$tr="aaaaaa";} else {$tr="ffffff";}
echo "<tr bgcolor=".$tr.">";
echo "<td>".$item['hash']."</td>";
echo "<td>".$item['name']."</td>";
echo "<td>".$item['status']."</td>";
echo "<td>".$item['starturl']."</td>";
echo "<td>".$item['depth']."</td>";
echo "<td>".$item['mustmatch']."</td>";
echo "<td>".$item['mustnotmatch']."</td>";
echo "<td>".$item['crawlingIfOlder']."</td>";
echo "<td>".$item['crawlingDomFilterDepth']."</td>";
echo "<td>".$item['crawlingDomFilterContent']."</td>";
echo "<td>".$item['DomMaxPages']."</td>";
echo "<td>".$item['withQuery']."</td>";
echo "<td>".$item['storeCache']."</td>";
echo "<td>".$item['indexText']."</td>";
echo "<td>".$item['indexMedia']."</td>";
echo "<td>".$item['remoteIndexing']."</td>";
echo "</tr>";
}
echo "</table>";
}
Editing or removing crawl jobs
Steering a peer
To intiate functions without awaiting a delivered result, like pausing/resuming crawls or shutdown the peer, just call the applet as in the admin-interface.
http://localhost:8080/Steering.html?restart=
will restart the peer after confirming admin credentials if not delivered with the query via http basic-auth. As the peer doesnt have to confirm this action nor does it has a need to deliver any data, no data must be parsed by the client.
Resources
As these examples show the YaCy API is very useful when you try to mash up data found or delivered by YaCy with data from other services, or simply build a customized interface for the YaCy community.
For more information about REST, XML, JSON and implementations in popular web programming languages see also
