用PHP写网络爬虫

2016-04-23 · 🙈Ray · 0条 · 566次

Last week, I need to get some useful data from other companies's website in the work. So I spent some time researching the web crawler. Here, I introduce several situtations you may meet when you crawl data from web.

1. General situation: Common web which just show the infomation, like the first page of Google ,or Baidu, or something . All the thing you need is shown by html, css.

It's easiest situation that we can meet. We can reach our goal just using the following code.

 <?php 
$url = "http://www.baidu.com";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, "");
$result = curl_exec($ch);
echo "$result";
curl_close($ch);

You can run the code in your own server, then you can see this:

2. Special situation 1: Need to login, such as BBS,Blog, or something else.

In this situation, we need to let the server think we are in. So when we send request to the server, we need send some necessary information at the same time, such as parameters and cookies. Here, I recommend an amazing tool to you called Flidder. Using it, you can get nearly all the thing you need. It runs based on .net framework. Download from: Flidder Download Address.


Now, let us see what we can get by using the way before.

 <?php 
$url = "http://10.200.21.61:7001/ieas2.1/cjcx/queryQmcj";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, "");
$result = curl_exec($ch);
echo "$result";

curl_close($ch);

Finally, I get this.

Next, I'll show how to crawl those websites by crawling data from the educational administration website of BUAA. Here, I try to get the grades data.

First, open Flidder, find the useful data we need.


This time, we need rewrite the code and add something necessary to it.

 <?php 
$url = "http://10.200.21.61:7001/ieas2.1/cjcx/queryQmcj";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, "");
$cookie = "JSESSIONID=fy9jXhLLXYKQRpnY56mqx1mSznwp*********************";//Here is the cookies we get by Flidder. We just need to add it.
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
$result = curl_exec($ch);
echo "$result";

curl_close($ch);

This time, I got this.

3. Special situation 3: There is no data when the web is returned. The web and the data are returned asynchronously.

In this situation, when you crawl the web using the url you can see in the html code, you will probably get just the html of the web but without the data. So, we need to find the real url and parameters using Flidder. Click the url in the left, find the url which can return the data. Then use this url and its parameters in the code which is similar with the code above.

We may crawl the html code with data, or crawl the html with data in script, or crawl just the json or xml data. No matter what type we get, just analyze it, then we can get the data we need.


  1