解决Ubuntu下,/usr/bin/pycompile无法找到模块ConfigParser

网站管理员 发表了文章 • 0 个评论 • 194 次浏览 • 2019-04-17 14:52 • 来自相关话题

pycompile出现异常,找不到模块ConfigParser,期初以为是自己没有安装,后来用pip安装尝试安装,已经安装了

```
Traceback (most recent call last):
File "/usr/bin/pycompile", line 35, in
from debpython.version import SUPPORTED, debsorted, vrepr, \
File "/usr/share/python/debpython/version.py", line 24, in
from ConfigParser import SafeConfigParser
ImportError: No module named 'ConfigParser'
```

ConfigParser是python2.x的一个参数parse模块,但是python3.x已经是用小写了```configparser```,加上自己的linux环境主要用的是python3.5,所以断定这个pycompile还是用的python2.x

```bash
whereis pycompile
# /usr/bin/pycompile
mv /usr/bin/pycompile /usr/bin/pycompile.backup
ln -s /usr/bin/py3compile /usr/bin/pycompile
```

用命令```whereis```查看了一下pycompile的路径,然后在该目录下找到了3.x版本的,果断备份,添加新的软链。 查看全部


pycompile出现异常,找不到模块ConfigParser,期初以为是自己没有安装,后来用pip安装尝试安装,已经安装了

```
Traceback (most recent call last):
File "/usr/bin/pycompile", line 35, in
from debpython.version import SUPPORTED, debsorted, vrepr, \
File "/usr/share/python/debpython/version.py", line 24, in
from ConfigParser import SafeConfigParser
ImportError: No module named 'ConfigParser'
```

ConfigParser是python2.x的一个参数parse模块,但是python3.x已经是用小写了```configparser```,加上自己的linux环境主要用的是python3.5,所以断定这个pycompile还是用的python2.x

```bash
whereis pycompile
# /usr/bin/pycompile
mv /usr/bin/pycompile /usr/bin/pycompile.backup
ln -s /usr/bin/py3compile /usr/bin/pycompile
```

用命令```whereis```查看了一下pycompile的路径,然后在该目录下找到了3.x版本的,果断备份,添加新的软链。

如何在Python3.x上安装Sentry,实现实时监控业务错误

网站管理员 发表了文章 • 0 个评论 • 208 次浏览 • 2019-04-17 14:40 • 来自相关话题

在Python3.x上安装Sentry是不可能的!!!

Sentry是一款收集错误的工具,能够实时展示给开发人员,并且后台界面做的十分的好看,
但是你会发现在Python3.x上安装十分的费劲,最后还是会以失败而告终。

2015年有人在github上提了个issue,问作者在Python3.x上为什么无法安装sentry,作者回答不支持。如果你尝试用pip安装sentry的话,会出现如下错误:

```
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting sentry
Using cached https://pypi.tuna.tsinghua.edu ... ar.gz
Collecting BeautifulSoup>=3.2.1 (from sentry)
Using cached https://pypi.tuna.tsinghua.edu ... ar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-krx08m8x/BeautifulSoup/setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-krx08m8x/BeautifulSoup/
```

在该项目的```setup.py```文件中,我们发现该项目用户的框架是```Django```,并且```Programming Language :: Python :: 2 :: Only```,只支持2.x,都9102年了,还在用Python2.x。

```
classifiers=[
'Framework :: Django',
'Intended Audience :: Developers',
'Intended Audience :: System Administrators',
'Operating System :: POSIX :: Linux',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 2 :: Only',
'Topic :: Software Development'
],
```

### 如何在Python2.x上安装Sentry

网上安装的教程太多了,我就不重复写了,可以pip安装,手动编译源码安装,甚至可以安装docker镜像,十分的方便

#### 推荐安装方法
```
https://www.cnblogs.com/scharf ... .html
``` 查看全部

在Python3.x上安装Sentry是不可能的!!!

Sentry是一款收集错误的工具,能够实时展示给开发人员,并且后台界面做的十分的好看,
但是你会发现在Python3.x上安装十分的费劲,最后还是会以失败而告终。

2015年有人在github上提了个issue,问作者在Python3.x上为什么无法安装sentry,作者回答不支持。如果你尝试用pip安装sentry的话,会出现如下错误:

```
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting sentry
Using cached https://pypi.tuna.tsinghua.edu ... ar.gz
Collecting BeautifulSoup>=3.2.1 (from sentry)
Using cached https://pypi.tuna.tsinghua.edu ... ar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-krx08m8x/BeautifulSoup/setup.py", line 22
print "Unit tests have failed!"
^
SyntaxError: Missing parentheses in call to 'print'

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-krx08m8x/BeautifulSoup/
```

在该项目的```setup.py```文件中,我们发现该项目用户的框架是```Django```,并且```Programming Language :: Python :: 2 :: Only```,只支持2.x,都9102年了,还在用Python2.x。

```
classifiers=[
'Framework :: Django',
'Intended Audience :: Developers',
'Intended Audience :: System Administrators',
'Operating System :: POSIX :: Linux',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 2 :: Only',
'Topic :: Software Development'
],
```

### 如何在Python2.x上安装Sentry

网上安装的教程太多了,我就不重复写了,可以pip安装,手动编译源码安装,甚至可以安装docker镜像,十分的方便

#### 推荐安装方法
```
https://www.cnblogs.com/scharf ... .html
```

Windows安装mitmproxy遇到的坑,官方二进制包只包含Python标准库!

网站管理员 发表了文章 • 0 个评论 • 291 次浏览 • 2019-04-15 10:52 • 来自相关话题

### 昨天在windows上用pip安装mitmproxy的时候,出现了一个异常的错误
```
Microsoft Visual C++ 14.0 is required
```
如果需要安装```Microsoft Visual C++ 14.0```的编译工具还是挺麻烦,于是想着官方有打包二进制包,直接安装得了,省事儿。出现无法加载第三方库,```No module named``` 结果发现官方编译的二进制不能使用第三方模块,瞬间感觉就不好了。

### github issues

> Addon scripts don't have access to full Python 3 standard library
> issues地址:https://github.com/mitmproxy/mitmproxy/issues/3238

### 作者给出的解释
>Hi,
>
>Our binaries only contain parts of Python’s stdlib to save space. If you need additional modules, you need to install mitmproxy via pip or from source: https://docs.mitmproxy.org/sta ... tion/

作者的意思是:为了节省二进制包的大小,只包含了Python的标准库,如果需要安装mitmproxy的话,建议用pip或者源码编译

### 解决方案

1. 最后还是不得不安装```Microsoft Visual C++ 14.0```,然后pip安装mitmproxy
2. 后台又试了一下,原来是安装不了```brotlipy, bindings to the Brotli compression library.```

打开如下网站,找到相应的python版本,以及系统类型,下载安装即可:
```bash
pip install brotlipy‑0.7.0‑cp37‑cp37m‑win_amd64.whl
```
```
https://www.lfd.uci.edu/~gohlke/pythonlibs/
``` 查看全部
### 昨天在windows上用pip安装mitmproxy的时候,出现了一个异常的错误
```
Microsoft Visual C++ 14.0 is required
```
如果需要安装```Microsoft Visual C++ 14.0```的编译工具还是挺麻烦,于是想着官方有打包二进制包,直接安装得了,省事儿。出现无法加载第三方库,```No module named``` 结果发现官方编译的二进制不能使用第三方模块,瞬间感觉就不好了。

### github issues

> Addon scripts don't have access to full Python 3 standard library
> issues地址:https://github.com/mitmproxy/mitmproxy/issues/3238

### 作者给出的解释
>Hi,
>
>Our binaries only contain parts of Python’s stdlib to save space. If you need additional modules, you need to install mitmproxy via pip or from source: https://docs.mitmproxy.org/sta ... tion/

作者的意思是:为了节省二进制包的大小,只包含了Python的标准库,如果需要安装mitmproxy的话,建议用pip或者源码编译

### 解决方案

1. 最后还是不得不安装```Microsoft Visual C++ 14.0```,然后pip安装mitmproxy
2. 后台又试了一下,原来是安装不了```brotlipy, bindings to the Brotli compression library.```

打开如下网站,找到相应的python版本,以及系统类型,下载安装即可:
```bash
pip install brotlipy‑0.7.0‑cp37‑cp37m‑win_amd64.whl
```
```
https://www.lfd.uci.edu/~gohlke/pythonlibs/
```


How to deal with the CryptographyDeprecationWarning in python fabric?

网站管理员 发表了文章 • 0 个评论 • 3208 次浏览 • 2019-01-28 15:57 • 来自相关话题

```text
c:\users\administrator\.virtualenvs\spiderworker-dgts38t8\lib\site-packages\paramiko\kex_ecdh_nist.py:39: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a fut
ure version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
m.add_string(self.Q_C.public_numbers().encode_point())
c:\users\administrator\.virtualenvs\spiderworker-dgts38t8\lib\site-packages\paramiko\kex_ecdh_nist.py:96: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a fu
ture version. Please use EllipticCurvePublicKey.from_encoded_point
self.curve, Q_S_bytes
c:\users\administrator\.virtualenvs\spiderworker-dgts38t8\lib\site-packages\paramiko\kex_ecdh_nist.py:111: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a fu
ture version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
hm.add_string(self.Q_C.public_numbers().encode_point())
```

finding the keywork in the source code, you will see class CryptographyDeprecationWarning. invoked by function ```_verify_openssl_version```.

```
"OpenSSL version 1.0.1 is no longer supported by the OpenSSL "
"project, please upgrade. A future version of cryptography will "
"drop support for it.",
``` 查看全部
```text
c:\users\administrator\.virtualenvs\spiderworker-dgts38t8\lib\site-packages\paramiko\kex_ecdh_nist.py:39: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a fut
ure version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
m.add_string(self.Q_C.public_numbers().encode_point())
c:\users\administrator\.virtualenvs\spiderworker-dgts38t8\lib\site-packages\paramiko\kex_ecdh_nist.py:96: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a fu
ture version. Please use EllipticCurvePublicKey.from_encoded_point
self.curve, Q_S_bytes
c:\users\administrator\.virtualenvs\spiderworker-dgts38t8\lib\site-packages\paramiko\kex_ecdh_nist.py:111: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a fu
ture version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
hm.add_string(self.Q_C.public_numbers().encode_point())
```

finding the keywork in the source code, you will see class CryptographyDeprecationWarning. invoked by function ```_verify_openssl_version```.

```
"OpenSSL version 1.0.1 is no longer supported by the OpenSSL "
"project, please upgrade. A future version of cryptography will "
"drop support for it.",
```

python解决唱吧歌词解密的问题?

网站管理员 发表了文章 • 0 个评论 • 168 次浏览 • 2018-11-22 16:14 • 来自相关话题

做唱吧歌词解密的时候选择了语言python,对于字节解码的时候用到了chr函数,但是chr函数参数限制在0 ~ 0xff(255),如果需要chr的值出现负数怎么办呢?我记得php用chr函数的时候支持负数,于是翻阅了一下php的源码,发现php做了一次按位与操作 c &= c,这样一来就不会再出现错误:

```python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> chr(-165)
Traceback (most recent call last):
File "", line 1, in
ValueError: chr() arg not in range(256)
>>> chr(-165 & 0xff)
'['
>>>
```

python解决唱吧歌词解密的问题?

```python
# -*-- coding:utf-8 -*--
import re
import os


class ChangBaDecrypt(object):
encrypt_key = [-50, -45, 110, 105, 64, 90, 97, 119, 94, 50, 116, 71, 81, 54, -91, -68, ]

def __init__(self):
pass

def decrypt(self, content):
decrypt_content = bytearray()
for i in range(0, len(content)):
var = content[i] ^ self.encrypt_key[i % 16]
decrypt_content.append(var & 0xff)
return decrypt_content.decode('utf-8')

def decrypt_by_file(self, filename):
with open(filename, 'rb') as f:
content = f.read()
f.close()
decrypt = self.decrypt(content)
if re.match(r'\[\d+,\d+\]', decrypt):
return decrypt


changba = ChangBaDecrypt()
decrypt = changba.decrypt_by_file(os.path.join(os.path.curdir, '../tests/data/a89f8523a6724a915c6b2038c928b342.zrce'))
print(decrypt)
``` 查看全部
做唱吧歌词解密的时候选择了语言python,对于字节解码的时候用到了chr函数,但是chr函数参数限制在0 ~ 0xff(255),如果需要chr的值出现负数怎么办呢?我记得php用chr函数的时候支持负数,于是翻阅了一下php的源码,发现php做了一次按位与操作 c &= c,这样一来就不会再出现错误:

```python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> chr(-165)
Traceback (most recent call last):
File "", line 1, in
ValueError: chr() arg not in range(256)
>>> chr(-165 & 0xff)
'['
>>>
```

python解决唱吧歌词解密的问题?

```python
# -*-- coding:utf-8 -*--
import re
import os


class ChangBaDecrypt(object):
encrypt_key = [-50, -45, 110, 105, 64, 90, 97, 119, 94, 50, 116, 71, 81, 54, -91, -68, ]

def __init__(self):
pass

def decrypt(self, content):
decrypt_content = bytearray()
for i in range(0, len(content)):
var = content[i] ^ self.encrypt_key[i % 16]
decrypt_content.append(var & 0xff)
return decrypt_content.decode('utf-8')

def decrypt_by_file(self, filename):
with open(filename, 'rb') as f:
content = f.read()
f.close()
decrypt = self.decrypt(content)
if re.match(r'\[\d+,\d+\]', decrypt):
return decrypt


changba = ChangBaDecrypt()
decrypt = changba.decrypt_by_file(os.path.join(os.path.curdir, '../tests/data/a89f8523a6724a915c6b2038c928b342.zrce'))
print(decrypt)
```

python解决window的进程powershell.exe占用cpu使用率过高

网站管理员 发表了文章 • 0 个评论 • 2922 次浏览 • 2018-11-13 18:38 • 来自相关话题

### 背景

window的进程powershell.exe占用cpu使用率过高,时常导致计算机卡顿,于是写了一个python的脚本监测这个脚本,当cpu使用到达了40%以上,直接终结这个进程。

### 依赖
这里用的库是```psutil```,这是一个跨平台的进程和系统工具,能够方便我们对cpu和内存等监测

```bash
pip install pipenv
pipenv install psutil
```

### 解决方法

脚本为循环任务,间隔为10s

```python
import psutil
import time

while True:
for proc in psutil.process_iter(attrs=['pid', 'name']):
if proc.name() == 'powershell.exe':
cpu_percent = proc.cpu_percent()
print('current cpu percent: %s' % str(cpu_percent))
if cpu_percent > 40:
proc.terminate()
print('powershell.exe has been terminate')

time.sleep(5)
``` 查看全部
### 背景

window的进程powershell.exe占用cpu使用率过高,时常导致计算机卡顿,于是写了一个python的脚本监测这个脚本,当cpu使用到达了40%以上,直接终结这个进程。

### 依赖
这里用的库是```psutil```,这是一个跨平台的进程和系统工具,能够方便我们对cpu和内存等监测

```bash
pip install pipenv
pipenv install psutil
```

### 解决方法

脚本为循环任务,间隔为10s

```python
import psutil
import time

while True:
for proc in psutil.process_iter(attrs=['pid', 'name']):
if proc.name() == 'powershell.exe':
cpu_percent = proc.cpu_percent()
print('current cpu percent: %s' % str(cpu_percent))
if cpu_percent > 40:
proc.terminate()
print('powershell.exe has been terminate')

time.sleep(5)
```

Python基础知识:快速了解字典的增删改查以及自定义不可变字典

网站管理员 发表了文章 • 0 个评论 • 276 次浏览 • 2018-11-06 10:08 • 来自相关话题

字典在很多的高级语言中是很常见的,java中的hashmap,php中的键值对的数组,python中的是dict,它是一个可变的容器模型,可以存储任意的数据结构,但是容器中的每个元素都是以键值对的形式存在的,形如key => value,python中是用冒号分隔,每个键值对用逗号分隔,所有的键值对用大括号包围起来。当然字典中还可以包含其他的字典,层级深度可以任意。有点儿像json,如果不了解python中的字典和json之间的转换可以看看这篇文章。

```python
Python 2.7.13 (default, Nov 24 2017, 17:33:09)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};
>>> dict['Name']
'Zara'
>>> dict['notExist']
Traceback (most recent call last):
File "", line 1, in
KeyError: 'notExist'
>>> del dict['Age']
>>> dict
{'Name': 'Zara', 'Class': 'First'}
>>> dict.clear()
>>> dict
{}
>>> if 'Age' in dict:
... print True
... else:
... print False
...
False
```

那么如何遍历一个字典的对象呢?这里有几种方法,第一种是我们直接for循环,拿到key,那么值就是字典的索引key;第二种就是字典有个方法是items,我们可以遍历这个items的返回值。千万要注意字典不是元组,是不会直接返回key,value这样的结果的。

```python
>>> for key, value in dict:
... print('%s => %s' % (key, value))
...
Traceback (most recent call last):
File "", line 1, in
ValueError: too many values to unpack
>>> for key, value in dict.items():
... print('%s => %s' % (key, value))
...
Beth => 9102
Alice => 2341
Cecil => 3258
>>> for key in dict:
... print('%s => %s' % (key, dict[key]))
...
Beth => 9102
Alice => 2341
Cecil => 3258
```

要注意一点的是如果这个key不在字典内,我们直接使用的话会出现异常KeyError,我们需要预先判断这个key是否在字典内,有两种方法,has_key(key)和key in dict来判断这个key是否在字典内,不过第二种方法更容易接受点,因为它更加的接近我们的语言。del这个key的所以会把元素从字典中直接删除,调用clear()方法可以直接把字典清空,然后就剩下一个空的字典{}。

有的时候我们需要一个不可变的字典,只能通过实例化的时候初始化这个字典,不能添加、更新、和删除。这个时候我们需要用到模块collections.MultiMapping了,集成类MultiMapping,改写它的方法,将__setitem__、__delitem__直接抛出异常,这样我们就得到了一个不可变的字典了。 查看全部
字典在很多的高级语言中是很常见的,java中的hashmap,php中的键值对的数组,python中的是dict,它是一个可变的容器模型,可以存储任意的数据结构,但是容器中的每个元素都是以键值对的形式存在的,形如key => value,python中是用冒号分隔,每个键值对用逗号分隔,所有的键值对用大括号包围起来。当然字典中还可以包含其他的字典,层级深度可以任意。有点儿像json,如果不了解python中的字典和json之间的转换可以看看这篇文章。

```python
Python 2.7.13 (default, Nov 24 2017, 17:33:09)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};
>>> dict['Name']
'Zara'
>>> dict['notExist']
Traceback (most recent call last):
File "", line 1, in
KeyError: 'notExist'
>>> del dict['Age']
>>> dict
{'Name': 'Zara', 'Class': 'First'}
>>> dict.clear()
>>> dict
{}
>>> if 'Age' in dict:
... print True
... else:
... print False
...
False
```

那么如何遍历一个字典的对象呢?这里有几种方法,第一种是我们直接for循环,拿到key,那么值就是字典的索引key;第二种就是字典有个方法是items,我们可以遍历这个items的返回值。千万要注意字典不是元组,是不会直接返回key,value这样的结果的。

```python
>>> for key, value in dict:
... print('%s => %s' % (key, value))
...
Traceback (most recent call last):
File "", line 1, in
ValueError: too many values to unpack
>>> for key, value in dict.items():
... print('%s => %s' % (key, value))
...
Beth => 9102
Alice => 2341
Cecil => 3258
>>> for key in dict:
... print('%s => %s' % (key, dict[key]))
...
Beth => 9102
Alice => 2341
Cecil => 3258
```

要注意一点的是如果这个key不在字典内,我们直接使用的话会出现异常KeyError,我们需要预先判断这个key是否在字典内,有两种方法,has_key(key)和key in dict来判断这个key是否在字典内,不过第二种方法更容易接受点,因为它更加的接近我们的语言。del这个key的所以会把元素从字典中直接删除,调用clear()方法可以直接把字典清空,然后就剩下一个空的字典{}。

有的时候我们需要一个不可变的字典,只能通过实例化的时候初始化这个字典,不能添加、更新、和删除。这个时候我们需要用到模块collections.MultiMapping了,集成类MultiMapping,改写它的方法,将__setitem__、__delitem__直接抛出异常,这样我们就得到了一个不可变的字典了。

如果你还在用xrang的话,可能low爆了,我来介绍一下itertools

网站管理员 发表了文章 • 0 个评论 • 199 次浏览 • 2018-10-26 16:20 • 来自相关话题

itertools是python内置的高效好用的迭代模块,迭代器有一个特色就是惰性求值,也就是只有当这个值被使用的时候才会计算,不需要预先在内存中生成这些数据,所以在遍历大文件、或者无限集合数组的时候,它的优势就格外的突出。

itertools的迭代器函数有三种类型

1. 无限迭代器:可以生成一个无限的序列,比如等差、等比、自然数都可以。
2. 有限迭代器:可以接收一个或者多个序列,然后组合、过滤、或者分组。
3. 组合生成器:序列的排列、组合,求序列的笛卡儿积等。

无限迭代器有三个函数,count(firstval=0, step=1)、cycle(iterable)、repeat(object [,times])。
```bash
>>> import itertools
>>>
>>> dir(itertools)
['__doc__', '__file__', '__name__', '__package__', 'chain', 'combinations', 'combinations_with_replacement', 'compress', 'count', 'cycle', 'dropwhile', 'groupby', 'ifilter', 'ifilterfalse', 'imap', 'islice', 'izip', 'izip_longest', 'permutations', 'product', 'repeat', 'starmap', 'takewhile', 'tee']
```

count的使用方法,firstval是开始数值,step(默认为1)是步长。比如开始值为5,step为2,那么生成的无限序列就是:5,7,9,11...,是一个开始值为5,差值为2的等差无限序列,你不需要担心没有更多的内存来存储这些数据,他们只有在使用的时候才会计算,如果你只需要5-20的数值,只需要用if判断一下,然后break就可以了。

cycle则是反复循环数值

```python
from __future__ import print_function
import itertools
cycle_strings = itertools.cycle('ABC')
i = 1
for string in cycle_strings:
if i == 10:
break
print('%d => %s' % (i, string), end=' ')
i += 1
```

### repeat反复生成一个对象
```python
from __future__ import print_function
import itertools
for item in itertools.repeat('hello world', 3):
print(item)
```

### itertools chain
```bash
>>> from __future__ import print_function
>>> import itertools
>>> for item in itertools.chain([1, 2, 3], ['a', 'b', 'c']):
... print(item, end=' ')
...
1 2 3 a b c
```


### 答疑:
为什么需要from __future__ import print_function?
- 由于我们需要不换行答应数据内容,所以需要在文件第一行引入这个hack,然后添加一个参数end,使其等于空格,完成我们不换行,空格相间输出。 查看全部

itertools是python内置的高效好用的迭代模块,迭代器有一个特色就是惰性求值,也就是只有当这个值被使用的时候才会计算,不需要预先在内存中生成这些数据,所以在遍历大文件、或者无限集合数组的时候,它的优势就格外的突出。

itertools的迭代器函数有三种类型

1. 无限迭代器:可以生成一个无限的序列,比如等差、等比、自然数都可以。
2. 有限迭代器:可以接收一个或者多个序列,然后组合、过滤、或者分组。
3. 组合生成器:序列的排列、组合,求序列的笛卡儿积等。

无限迭代器有三个函数,count(firstval=0, step=1)、cycle(iterable)、repeat(object [,times])。
```bash
>>> import itertools
>>>
>>> dir(itertools)
['__doc__', '__file__', '__name__', '__package__', 'chain', 'combinations', 'combinations_with_replacement', 'compress', 'count', 'cycle', 'dropwhile', 'groupby', 'ifilter', 'ifilterfalse', 'imap', 'islice', 'izip', 'izip_longest', 'permutations', 'product', 'repeat', 'starmap', 'takewhile', 'tee']
```

count的使用方法,firstval是开始数值,step(默认为1)是步长。比如开始值为5,step为2,那么生成的无限序列就是:5,7,9,11...,是一个开始值为5,差值为2的等差无限序列,你不需要担心没有更多的内存来存储这些数据,他们只有在使用的时候才会计算,如果你只需要5-20的数值,只需要用if判断一下,然后break就可以了。

cycle则是反复循环数值

```python
from __future__ import print_function
import itertools
cycle_strings = itertools.cycle('ABC')
i = 1
for string in cycle_strings:
if i == 10:
break
print('%d => %s' % (i, string), end=' ')
i += 1
```

### repeat反复生成一个对象
```python
from __future__ import print_function
import itertools
for item in itertools.repeat('hello world', 3):
print(item)
```

### itertools chain
```bash
>>> from __future__ import print_function
>>> import itertools
>>> for item in itertools.chain([1, 2, 3], ['a', 'b', 'c']):
... print(item, end=' ')
...
1 2 3 a b c
```


### 答疑:
为什么需要from __future__ import print_function?
- 由于我们需要不换行答应数据内容,所以需要在文件第一行引入这个hack,然后添加一个参数end,使其等于空格,完成我们不换行,空格相间输出。

spark/pyspark统计日志,计算ip,uv

网站管理员 发表了文章 • 0 个评论 • 305 次浏览 • 2018-10-15 20:20 • 来自相关话题

## spark/pyspark统计日志,计算ip,uv

### 什么是spark

Apache Spark是专为大规模数据处理而设计的快速通用的计算引擎,拥有Hadoop MapReduce所具有的有点;但是Job中间输出的结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好的适用于数据挖掘与机器学习等需要迭代MapReduce的算法

可用于构建大型的、低延迟的数据分析应用程序

### 特色

- 高级API剥离了对集群本身的关注、Spark应用开发者可以专注于应用所要做的计算本身
- Spark计算很快,支持交互式计算和复杂的算法
- Spark是通用的计算引擎,可以完成各式各样的运算,包括SQL查询、文本处理,机器学习

### PySpark统计UV, IP,

#### 日志格式
```json
[2018-10-09 00:01:08] local.INFO: {"api":"/api/products/295","status":200,"method":"GET","userId":34,"ip":"14.111.56.47","authorization":"xxxxxxx","app-version":"1.7.21","app-client":"webapp","app-brand":null,"app-model":null,"app-language":null,"request":[],"response":[]}
```

- 加载文本,```flatMap```文本,去掉时间,格式化合法的json,```json.loads()```将json解析成python的字典,然后去掉request后面的```get```参数,
然后返回数组api、userId、ip,
- 调用map,迭代,调用```reduceByKey```统计每个内容出现的频率
- 返回集合```collect```


```python
from pyspark.context import SparkContext
import re
import json

pattern = re.compile(r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\][^{]+(.+)$')
replace_interrogation = re.compile(r'\?[^?]+$')

logFile = "/home/xxx/logs/api-2018-10-09.log"
spark = SparkContext('local[4]', 'test')

file = spark.textFile(logFile)

def flat_request(s):
words = pattern.match(s)
if words is None:
return None
date, request = words.groups()
request = json.loads(request)
request['api'] = replace_interrogation.sub('', request['api'])
return [request['api'], request['userId'], request['ip']]


count = file.flatMap(flat_request)\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a + b)\
.collect()

print(count)
```

#### 输出结果
```python
[('/api/boot', 11), (0, 168), ('14.111.56.47', 146), ('/api/products/295', 4), (34, 35), ('183.227.132.159', 20), ('/api/shopping/cart', 11), ('/api/products/149', 1), ('122.110.134.17', 6), ('61.148.243.63', 2), ('/api/address/set/default', 1), ('/api/wx/share/webapp', 1), ('/api/wx/h5/login', 139), ('/api/wx/h5/callback', 2), ('/api/found/circle', 1), (39, 1), ('/api/user', 1), ('/api/product/recommend', 7), ('/', 3), ('140.205.205.25', 1), ('/ueditor/server', 21), ('209.97.147.150', 1), ('117.188.2.120', 15), ('58.17.200.100', 12), ('/api/order/confirm', 1), ('183.136.190.62', 1)]
```

### 环境配置

[https://blog.csdn.net/lzufeng/ ... 96083](https://blog.csdn.net/lzufeng/ ... 096083)


如何执行脚本

[how to run python script](http://spark.apache.org/docs/l ... m-here)
```bash
/mnt/spark-2.2.0/bin/spark-submit simple.py
```


### 案例,实时日志错误数量统计

[https://blog.csdn.net/weixin_3 ... 74669](https://blog.csdn.net/weixin_3 ... 174669)


```./bin/pyspark```进入shell交互

### lambda介绍

[https://www.cnblogs.com/evenin ... .html](https://www.cnblogs.com/evenin ... 4.html)


### 名词解释

- RDD(Resilient Distributed Datasets),弹性分布式数据集
- partition 分区,分表 查看全部
## spark/pyspark统计日志,计算ip,uv

### 什么是spark

Apache Spark是专为大规模数据处理而设计的快速通用的计算引擎,拥有Hadoop MapReduce所具有的有点;但是Job中间输出的结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好的适用于数据挖掘与机器学习等需要迭代MapReduce的算法

可用于构建大型的、低延迟的数据分析应用程序

### 特色

- 高级API剥离了对集群本身的关注、Spark应用开发者可以专注于应用所要做的计算本身
- Spark计算很快,支持交互式计算和复杂的算法
- Spark是通用的计算引擎,可以完成各式各样的运算,包括SQL查询、文本处理,机器学习

### PySpark统计UV, IP,

#### 日志格式
```json
[2018-10-09 00:01:08] local.INFO: {"api":"/api/products/295","status":200,"method":"GET","userId":34,"ip":"14.111.56.47","authorization":"xxxxxxx","app-version":"1.7.21","app-client":"webapp","app-brand":null,"app-model":null,"app-language":null,"request":[],"response":[]}
```

- 加载文本,```flatMap```文本,去掉时间,格式化合法的json,```json.loads()```将json解析成python的字典,然后去掉request后面的```get```参数,
然后返回数组api、userId、ip,
- 调用map,迭代,调用```reduceByKey```统计每个内容出现的频率
- 返回集合```collect```


```python
from pyspark.context import SparkContext
import re
import json

pattern = re.compile(r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\][^{]+(.+)$')
replace_interrogation = re.compile(r'\?[^?]+$')

logFile = "/home/xxx/logs/api-2018-10-09.log"
spark = SparkContext('local[4]', 'test')

file = spark.textFile(logFile)

def flat_request(s):
words = pattern.match(s)
if words is None:
return None
date, request = words.groups()
request = json.loads(request)
request['api'] = replace_interrogation.sub('', request['api'])
return [request['api'], request['userId'], request['ip']]


count = file.flatMap(flat_request)\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a + b)\
.collect()

print(count)
```

#### 输出结果
```python
[('/api/boot', 11), (0, 168), ('14.111.56.47', 146), ('/api/products/295', 4), (34, 35), ('183.227.132.159', 20), ('/api/shopping/cart', 11), ('/api/products/149', 1), ('122.110.134.17', 6), ('61.148.243.63', 2), ('/api/address/set/default', 1), ('/api/wx/share/webapp', 1), ('/api/wx/h5/login', 139), ('/api/wx/h5/callback', 2), ('/api/found/circle', 1), (39, 1), ('/api/user', 1), ('/api/product/recommend', 7), ('/', 3), ('140.205.205.25', 1), ('/ueditor/server', 21), ('209.97.147.150', 1), ('117.188.2.120', 15), ('58.17.200.100', 12), ('/api/order/confirm', 1), ('183.136.190.62', 1)]
```

### 环境配置

[https://blog.csdn.net/lzufeng/ ... 96083](https://blog.csdn.net/lzufeng/ ... 096083)


如何执行脚本

[how to run python script](http://spark.apache.org/docs/l ... m-here)
```bash
/mnt/spark-2.2.0/bin/spark-submit simple.py
```


### 案例,实时日志错误数量统计

[https://blog.csdn.net/weixin_3 ... 74669](https://blog.csdn.net/weixin_3 ... 174669)


```./bin/pyspark```进入shell交互

### lambda介绍

[https://www.cnblogs.com/evenin ... .html](https://www.cnblogs.com/evenin ... 4.html)


### 名词解释

- RDD(Resilient Distributed Datasets),弹性分布式数据集
- partition 分区,分表

scrapy爬虫的正确使用姿势,从入门安装开始(一)

网站管理员 发表了文章 • 0 个评论 • 587 次浏览 • 2018-08-24 17:40 • 来自相关话题

### scrapy 官方介绍

```
An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
```

1. 这是一个开源协作的框架,用于从网站中提取你需要的数据,并且,快速,简单,可扩展。
2. 总所周知,这是一个很强大的爬虫框架,能够帮助我们从众多的网页中提取出我们需要的数据,且快速易于扩展。
3. 它是由```python```编写的,所以完成编码后,我们可以运行在```Linux```、```Windows```、```Mac``` 和 ```BSD```

#### 官方数据

- 24k stars, 6k forks and 1.6k watchers on GitHub
- 4.0k followers on Twitter
- 8.7k questions on StackOverflow

#### 版本要求

- Python 2.7 or Python 3.4+
- Works on Linux, Windows, Mac OSX, BSD

#### 快速安装

如果不知道怎么安装```pip```,可以查看这篇文章[《如何快速了解pip编译安装python的包管理工具pip?》](http://www.sourcedev.cc/article/131)

```
pip install scrapy
```

### 创建项目

```bash
root@ubuntu:/# scrapy startproject -h
Usage
=====
scrapy startproject [project_dir]

Create new project

Options
=======
--help, -h show this help message and exit

Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
root@ubuntu:/# scrapy startproject helloDemo
New Scrapy project 'helloDemo', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:
/helloDemo

You can start your first spider with:
cd helloDemo
scrapy genspider example example.com
root@ubuntu:/# cd helloDemo/
root@ubuntu:/helloDemo# ls
helloDemo scrapy.cfg

root@ubuntu:/helloDemo# scrapy crawl spider baidu
2018-08-24 17:33:00 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)
2018-08-24 17:33:00 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['helloDemo.spiders'], 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'helloDemo'}
Usage
=====
scrapy crawl [options]

crawl: error: running 'scrapy crawl' with more than one spider is no longer supported
root@ubuntu:/helloDemo# scrapy crawl baidu
2018-08-24 17:33:06 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)
2018-08-24 17:33:06 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'helloDemo', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'SPIDER_MODULES': ['helloDemo.spiders']}
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-24 17:33:07 [scrapy.core.engine] INFO: Spider opened
2018-08-24 17:33:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-24 17:33:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-24 17:33:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to from
2018-08-24 17:33:07 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2018-08-24 17:33:07 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt:
2018-08-24 17:33:08 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-24 17:33:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 443,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1125,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 8, 24, 9, 33, 8, 11376),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 52117504,
'memusage/startup': 52117504,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 8, 24, 9, 33, 7, 430751)}
2018-08-24 17:33:08 [scrapy.core.engine] INFO: Spider closed (finished)
```

#### 代码目录结构
```base
root@ubuntu:/helloDemo# tree
.
├── helloDemo
│   ├── __init__.py
│   ├── items.py # 实体,数据结构
│   ├── middlewares.py # 爬虫的中间件
│   ├── pipelines.py # 管道,数据的存储
│   ├── __pycache__
│   │   ├── __init__.cpython-35.pyc
│   │   └── settings.cpython-35.pyc
│   ├── settings.py # 全局设置
│   └── spiders # 爬虫蜘蛛项目
│   ├── baidu.py # 上面创建的baidu爬虫的项目
│   ├── __init__.py
│   └── __pycache__
│   └── __init__.cpython-35.pyc
└── scrapy.cfg
```

spiders/baidu.py是我们需要我们处理数据的地方,response是抓取时返回的整个html DOM结构
```python
# -*- coding: utf-8 -*-
import scrapy


class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['wwww.baidu.com']
start_urls = ['http://wwww.baidu.com/']

def parse(self, response):
pass
```

### 后面的文章我会继续介绍scrapy的用法


### 参考资源

1. [scrapy官网地址](https://scrapy.org)
2. [scrapy官方文档](https://docs.scrapy.org/en/latest/)
3. [github开源项目源码](https://github.com/scrapy/scrapy)
4. [pip的安装](http://www.sourcedev.cc/article/131) 查看全部

### scrapy 官方介绍

```
An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
```

1. 这是一个开源协作的框架,用于从网站中提取你需要的数据,并且,快速,简单,可扩展。
2. 总所周知,这是一个很强大的爬虫框架,能够帮助我们从众多的网页中提取出我们需要的数据,且快速易于扩展。
3. 它是由```python```编写的,所以完成编码后,我们可以运行在```Linux```、```Windows```、```Mac``` 和 ```BSD```

#### 官方数据

- 24k stars, 6k forks and 1.6k watchers on GitHub
- 4.0k followers on Twitter
- 8.7k questions on StackOverflow

#### 版本要求

- Python 2.7 or Python 3.4+
- Works on Linux, Windows, Mac OSX, BSD

#### 快速安装

如果不知道怎么安装```pip```,可以查看这篇文章[《如何快速了解pip编译安装python的包管理工具pip?》](http://www.sourcedev.cc/article/131)

```
pip install scrapy
```

### 创建项目

```bash
root@ubuntu:/# scrapy startproject -h
Usage
=====
scrapy startproject [project_dir]

Create new project

Options
=======
--help, -h show this help message and exit

Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
root@ubuntu:/# scrapy startproject helloDemo
New Scrapy project 'helloDemo', using template directory '/usr/local/lib/python3.5/dist-packages/scrapy/templates/project', created in:
/helloDemo

You can start your first spider with:
cd helloDemo
scrapy genspider example example.com
root@ubuntu:/# cd helloDemo/
root@ubuntu:/helloDemo# ls
helloDemo scrapy.cfg

root@ubuntu:/helloDemo# scrapy crawl spider baidu
2018-08-24 17:33:00 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)
2018-08-24 17:33:00 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['helloDemo.spiders'], 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'helloDemo'}
Usage
=====
scrapy crawl [options]

crawl: error: running 'scrapy crawl' with more than one spider is no longer supported
root@ubuntu:/helloDemo# scrapy crawl baidu
2018-08-24 17:33:06 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloDemo)
2018-08-24 17:33:06 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'helloDemo', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'helloDemo.spiders', 'SPIDER_MODULES': ['helloDemo.spiders']}
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-24 17:33:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-24 17:33:07 [scrapy.core.engine] INFO: Spider opened
2018-08-24 17:33:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-24 17:33:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-24 17:33:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to http://www.baidu.com/robots.txt> from http://wwww.baidu.com/robots.txt>
2018-08-24 17:33:07 [scrapy.core.engine] DEBUG: Crawled (200) http://www.baidu.com/robots.txt> (referer: None)
2018-08-24 17:33:07 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: http://wwww.baidu.com/>
2018-08-24 17:33:08 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-24 17:33:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 443,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1125,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 8, 24, 9, 33, 8, 11376),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 52117504,
'memusage/startup': 52117504,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 8, 24, 9, 33, 7, 430751)}
2018-08-24 17:33:08 [scrapy.core.engine] INFO: Spider closed (finished)
```

#### 代码目录结构
```base
root@ubuntu:/helloDemo# tree
.
├── helloDemo
│   ├── __init__.py
│   ├── items.py # 实体,数据结构
│   ├── middlewares.py # 爬虫的中间件
│   ├── pipelines.py # 管道,数据的存储
│   ├── __pycache__
│   │   ├── __init__.cpython-35.pyc
│   │   └── settings.cpython-35.pyc
│   ├── settings.py # 全局设置
│   └── spiders # 爬虫蜘蛛项目
│   ├── baidu.py # 上面创建的baidu爬虫的项目
│   ├── __init__.py
│   └── __pycache__
│   └── __init__.cpython-35.pyc
└── scrapy.cfg
```

spiders/baidu.py是我们需要我们处理数据的地方,response是抓取时返回的整个html DOM结构
```python
# -*- coding: utf-8 -*-
import scrapy


class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['wwww.baidu.com']
start_urls = ['http://wwww.baidu.com/']

def parse(self, response):
pass
```

### 后面的文章我会继续介绍scrapy的用法


### 参考资源

1. [scrapy官网地址](https://scrapy.org)
2. [scrapy官方文档](https://docs.scrapy.org/en/latest/)
3. [github开源项目源码](https://github.com/scrapy/scrapy)
4. [pip的安装](http://www.sourcedev.cc/article/131)