0%

python spider

Posted on 2022-07-22 Edited on 2022-08-05 In system Disqus:

python spider : application and practice

web spider

Search Engine

Crawl the web page

Robots协议（也叫爬虫协议、机器人协议等），全称是“网络爬虫排除标准”（Robots Exclusion Protocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取

Data storage

搜索引擎通过爬虫爬取到的网页，将数据存入原始页面数据库
页面数据一般是HTML

Pre-processing

搜索引擎将爬虫抓取回来的页面，进行各种步骤的预处理。
- 提取文字
- 中文分词
- 消除噪音（比如版权声明文字、导航条、广告等……）
- 索引处理
- 链接关系计算
- 特殊文件处理
搜索引擎通常还能抓取HTML以外的和索引以文字为基础的多种文件类型，如 PDF、Word、WPS、XLS、PPT、TXT 文件等
搜索引擎还不能处理图片、视频、Flash 这类非文字内容，也不能执行脚本和程序

Provide search services and website ranking

*

Post author: Gan Gecen
Post link: https://ganliber.github.io/2022/07/22/system/python-spider-1/
Copyright Notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.