如何使用 Python 抓取雪球网页

2024-05-17 23:22

1. 如何使用 Python 抓取雪球网页

最简单可以用urllib，python2.x和python3.x的用法不同，以python2.x为例：

import urllib
html = urllib.open(url)
text = html.read()
复杂些可以用requests库，支持各种请求类型，支持cookies，header等
再复杂些的可以用selenium，支持抓取javascript产生的文本

如何使用 Python 抓取雪球网页

2. 怎么用python抓取网页并实现一些提交操作？

下面这个程序是抓取网页的一个例子，MyOpener类是为了模拟浏览器客户端，并采用随机选取的方式以防网站将你认为是机器人。
MyFunc函数抓取你指定的url，并提取了其中的href链接，图片的获取类似，一般是这样的形式，其他的功能应该也不难，去网上搜下应该有些例子。

import re
from urllib import FancyURLopener
from random import choice

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'
]

class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

def MyFunc(url):    
    myopener = MyOpener()
    s = myopener.open(url).read()
    ss=s.replace("\n"," ")
    urls=re.findall(r"",ss,re.I)#寻找href链接
    for i in urls:
        do sth.

3. 如何用python抓取网页特定内容

最简单可以用urllib，python2.x和python3.x的用法不同，以python2.x为例：

import urllibhtml = urllib.open(url)text = html.read()复杂些可以用requests库，支持各种请求类型，支持cookies，header等
再复杂些的可以用selenium，支持抓取javascript产生的文本

我设计了简单的爬虫闯关网站 www.heibanke.com/lesson/crawler_ex00/
新手如果能自己把三关闯过，相信一定会有所收获。
题解在课程里有提到http://study.163.com/course/courseMain.htm?courseId=1000035

如何用python抓取网页特定内容

4. 求python抓网页的代码

python3.x中使用urllib.request模块来抓取网页代码，通过urllib.request.urlopen函数取网页内容，获取的为数据流，通过read()函数把数字读取出来，再把读取的二进制数据通过decode函数解码（编号可以通过查看网页源代码中得知，如下例中为gbk编码。），这样就得到了网页的源代码。
如下例所示，抓取本页代码：
import urllib.requesthtml = urllib.request.urlopen(').read().decode('gbk') #注意抓取后要按网页编码进行解码print(html)以下为urllib.request.urlopen函数说明：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to the server, or None if no such data is needed. data may also be an iterable object and in that case Content-Length value must be specified in the headers. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format. It should be encoded to bytes before being used as the data parameter. The charset parameter in Content-Type header may be used to specify the encoding. If charset parameter is not sent with the Content-Type header, the server following the HTTP 1.1 recommendation may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to use charset parameter with encoding used in Content-Type header with the Request.

urllib.request module uses HTTP/1.1 and includes Connection:close header in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests. cafile should point to a single file containing a bundle of CA certificates, whereas capath should point to a directory of hashed certificate files. More information can be found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the following HTTPResponse Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers)
getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen. Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be obtained by using ProxyHandler objects.


Changed in version 3.2: cafile and capath were added.


Changed in version 3.2: HTTPS virtual hosts are now supported if possible (that is, if ssl.HAS_SNI is true).


New in version 3.2: data can be an iterable object.


Changed in version 3.3: cadefault was added.


Changed in version 3.4.3: context was added.

5. 如何用Python抓取动态页面信息

　很早之前，学习Python web编程的时候，就涉及一个Python的urllib。可以用urllib.urlopen("url").read()可以轻松读取页面上面的静态信息。但是，随着时代的发展，也来越多的网页中更多的使用javascript、jQuery、PHP等语言动态生成页面信息。因此，用urllib再去抓取页面HTML就不足以达到我们想要的效果。
 
　　解决思路：
　　有一个思路最为简单的思路可以动态解析页面信息。urllib不可以解析动态信息，但是浏览器可以。在浏览器上展现处理的信息其实是处理好的HTML文档。这为我们抓取动态页面信息提供了很好的思路。在Python中有一个很有名的图形库——PyQt。PyQt虽然是图形库，但是他里面 QtWebkit。这个很实用。谷歌的Chrome和苹果的Safari都是基于WebKit内核开发的，所以我们可以通过PyQt中得QtWebKit 把页面中的信息读取加载到HTML文档中，再解析HTML文档，从HTML文档中提取我们想用得信息。

　　作者本人实用Mac OS X。应该在Windows和Linux平台也可以采用相同的办法。

　　1、Qt4 library
　　Library，而不是Creator。Library在Mac的默认安装路径下，应该是/home/username/Developor/，不要改变Qt4的默认安装路径。否则可能安装失败。

如何用Python抓取动态页面信息

6. 用python怎么提取已经抓取的网页的主要内容

我这里：
【教程】抓取网并提取网页中所需要的信息 之 Python版
有代码和注释。

不过，看这个之前，你最好参考：
【整理】关于抓取网页，分析网页内容，模拟登陆网站的逻辑/流程和注意事项
去了解网站抓取相关的逻辑，然后再参考：
【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
去抓取你所要处理的网站的内在执行逻辑。

(此处不给贴地址，请自己用google搜索帖子标题，即可找到帖子地址)

7. python 抓取网页的内容为空，怎么处理？谢谢了

你这里的
table_ouzhi
tr
td
这些变量都是怎么来的？
 
另外，你如果想要让别人帮你解决问题，那么前提肯定是自己把问题相关的代码，必要的代码，以及所涉及的内容，比如相关的要处理的网页的内容，都贴出来，以及自己想要实现什么目的，
都说清楚了，别人才能帮你，否则相帮也帮不了啊，你说是不是?
 
另外，其实这个也就是做热做事的态度和原则问题。
希望以后不要发生类似事情，否则降低事情效率，浪费彼此沟通的精力。你说是不是？

python 抓取网页的内容为空，怎么处理？谢谢了

8. python抓取网页数据

是字符编码的问题吧，那网页的编码是 gb2312，那么应该这么解码：
content.decode('gb2312')
content 是你的网页内容