Love is how spiders in Shanghai included a site search engine working process secret

index has positive index and inverted index, is put forward index keywords "corresponds to the content, the reverse is the web page information corresponding to the keywords.

information filtering


in order to avoid repeated crawl and crawl web site, search engine will have a record is not crawling and crawling address database, if you have a new website, you can go to love Shanghai official website to submit the URL of the website, the engine will record it, and classify it to crawl web site, and then the spider will according to this table, URL extracted from the database, access and crawl the page.


2, information filtering


4, the user search results.

when love spiders in Shanghai came to a page, it will follow the links on the pages, crawling from the page to the next page, just like a recursive process, so more than years of work, tired. For example, the spider came to my blog page 贵族宝贝blog.sina贵族宝贝.cn/net of Shanghai dragon Er, it will read the robots.txt file in the root directory of the search engines, if not prohibited, so the spider began for a link on the page, by tracking the crawling. For example, my articles "overview of what is Shanghai Shanghai dragon | dragon Shanghai dragon in the end is doing", the engine will more into the program to "grab information of this article lies, so through the bad, no end.

spider crawling


spider crawling.

when a user search keywords, the index table will pass in front of the establishment of matching keywords, find the corresponding keyword index table through a reverse page, after passing through the engine for the web page to determine the comprehensive score calculation, according to the web page’s score ranking order. The spider is how to "comprehensive score of this?"

search engine can be divided into four processes.

when spiders crawl a page on the first page text content analysis. Through the segmentation technology, will simplify the content of the page to the keywords, and keyword and the corresponding URL into a table indexed.


create page keyword index


spiders do not included all of the pages, it must go through rigorous testing. When a spider crawling in and grab the content of a web page, will copy the contents of a certain degree of detection, if the web site weight is low, and most of the articles are copied, the spider will probably not love your website, not to continue the crawl, is not included in your site.

search engine working process is very complicated, and share with you today I understand the love of spiders in Shanghai is how to achieve web pages included.

3, establish "keyword index

