|
Web Crawling
Like most Internet search engines, Hailoo is essentially an enormous index of the Internet. In order to create this index, Hailoo sends out "webcrawlers" to explore and download interesting websites. A "webcrawler," also known as a webspider or robot, is a computer program which visits websites and extracts links from each page in order to find more pages to download. Each page is then added to Hailoo's index so that it can potentially appear in a search result.
Unlike other Internet search engines, Hailoo’s webcrawlers are designed to specifically seek out Arabic webpages, or webpages which may be interesting to Middle Easterners or Arabic-speaking people.
Your Website
In order for your website to appear in Hailoo search results, our webcrawlers must be able to find it. Our crawlers will have an easier time finding your website if other webpages link to your site. As a webmaster, you can find out if Hailoo is crawling your site by checking your server logs for Hailoo's user-agent:
Mozilla/5.0 (compatible; Hailoobot/X.X; +http://www.hailoo.com/spider.html)
Once our webcrawlers find your site, they are likely to recrawl your site every couple of days to look for updates, depending on how frequently your webpage is updated.
The Robots Exclusion Protocol
If you wish to prevent our webcrawlers from downloading and indexing certain webpages on your domain, you must include a robots.txt file in the root directory of your domain. The robots.txt file consists of directives which refer to one or more user-agents (webcrawlers). You can create a directive which applies to a specific webcrawler (such as Hailoo), or you can create a directive using a wildcard character that applies to all webcrawlers.
For example, in order to prevent Hailoo's webcrawlers from downloading http://www.example.com/mypage.html, the robots.txt file at www.example.com should include either of the following directives:
User-Agent: Hailoobot
Disallow: /mypage.html
or:
User-Agent: *
Disallow: /mypage.html
You may also be interested in reading more information about the robots exclusion protocol.
|