PHP采集到的数据是chunked传输编码,gzip压缩格式的
chunk编码的思路貌似是: 将数据分块传输,每一块分为头部和主体字段,头部包含主体信息的长度且以16进制表示,头部和主体以回车换行符分隔,最后一块以单行的0表示分块结束。。
响应头信息:
Array( [0] => HTTP/1.1 200 OK [1] => Server: Dict/34002 [2] => Date: Wed, 17 Dec 2014 06:49:22 GMT [3] => Content-Type: text/html; charset=utf-8 [4] => Transfer-Encoding: chunked [5] => Connection: keep-alive [6] => Keep-Alive: timeout=60 [7] => Cache-Control: private [8] => Last-Modified: Wed, 17 Dec 2014 04:57:49 GMT [9] => Expires: Wed, 17 Dec 2014 06:49:22 GMT [10] => Set-Cookie: uvid=VJEncoTSVYJC; expires=Thu, 31-Dec-37 23:55:55 GMT; domain=.dict.cn; path=/ [11] => Content-Encoding: gzip)
if($this->response_num==200) { if($this->is_chunked) { //读取chunk头部信息,获取chunk主体信息的长度 $chunk_size = (int)hexdec(fgets($this->conn)); // while(!feof($this->conn) && $chunk_size > 0) { //读取chunk头部指定长度的信息 $this->response_body .= fread( $this->conn, $chunk_size ); fseek($this->conn, 2, SEEK_CUR); $chunk_size = (int)hexdec(fgets( $this->conn,4096)); } } else { $len=0; //读取请求返回的主体信息 while($items = fread($this->conn, $this->response_body_length)) { $len = $len+strlen($items); $this->response_body = $items; //当读取完请求的主体信息后跳出循环,不这样做,貌似会被阻塞!!! if($len >= $this->response_body_length) { break; } } } if($this->is_gzip) { $this->response_body = gzinflate(substr($this->response_body,10)); } $this->getTrans($this->response_body); }
基本上每次都会出现这个提示:
Warning: gzinflate(): data error in E:\CodeEdit\php\http\dict.php on line 384
偶尔能正常解析,应该是chunked解码有问题,查看过一些资料,也变换过集中解码方式,但还是功亏一篑
回复讨论(解决方案)
你可用 gzdecode 解码
你可用 gzdecode 解码
奇怪的是有时可以获取到结果,比如:
int.(打招呼)喂;你好
有时提示错误,比如:
Warning: gzinflate(): data error in E:\CodeEdit\php\http\dict.php on line 380
估计错误还是出现在chunked解码这块,这里的问题是返回的数据是先经过gzip压缩,然后通过chunked分块传输的,所以解码的过程就是反过来的
你可用 gzdecode 解码
if($this->is_chunked) { /* //读取chunk头部信息,获取chunk主体信息的长度 $chunk_size = (int)hexdec(trim(fgets($this->conn))); while(!feof($this->conn) && $chunk_size > 0) { //读取chunk头部指定长度的信息 $this->response_body .= fread( $this->conn, $chunk_size ); fseek($this->conn, 2, SEEK_CUR); $next_line = trim(fgets($this->conn)); if($next_line === '0') { echo $next_line;exit(); } else { $chunk_size = (int)hexdec($next_line); } } */ while(!feof($this->conn)) { $this->response_body .= fread($this->conn, 1024); } if(preg_match_all("#\r\n#i", $this->response_body, $match)) { $result=preg_split("#\r\n#i", $this->response_body, -1, PREG_SPLIT_NO_EMPTY ); // echo ""; // print_r($result); /* foreach($result as $v) { echo $v."
"; } echo "
"; */ /* echo hexdec($result[0])."
"; echo mb_strlen($result[1])+mb_strlen($result[2])."
"; */ $len = count($result); $this->response_body=''; for($i=1; $i<$len-1; $i++) { $this->response_body .= $result[$i]; } //echo strlen($this->response_body); exit(); } else { die("匹配结束符失败"); } }
基本思路,首先把头部改成connection:close,这样可以通过while(!feof($this->conn))一次性读取所有的数据
然后因为chunk分块传输的头部和主体之间是用回车换行分隔的,所以直接用正则分割,得到一个数组包含数据长度和数据的数组,第一项表示所有数据的总长度(而不是每一个chunk分块的长度,这个貌似和chunk编码有点出入,也难怪按照chunk编码会失败,),最后一个数组项为0表示结束.。。。反复测试,OK了