在如何用vba实现网站数据采集(网抓)? 一文中我们介绍了如何在vba中实现网站数据采集。
在上文中采集的是网站的源代码,也就是html文档,但是具体到一个网抓案例,我们往往需要的是其中的具体的某类数据,比如网站中的表格数据。
这时候有多种处理方式,比如用vba内置的各种字符串处理函数split,instr等处理。
这里介绍一种利用html dom对象模型访问html文档对象模型的方法,使用通用的面向对象技术访问采集到的网站数据。
本文介绍用WinHttpRequest对象进行网站数据采集的方法。
html dom 对象模型把所有html元素都当做对象,由于网页表格的标签是table,如果要获取所有的table标签元素,可以使用getElementsByTagName(“table”),表格的行、列、单元格都属于对象。通过面向对象的技术,访问所有这些元素。代码如下:
Sub QQ1722187970() Dim sVerb As String Dim sUrl As String Dim sCharset As String Dim sPostData As String sVerb = "GET" sUrl = "http://www.77tj.org/tencent/" Dim sToken As String sResult = HtmlBasic("GET", sUrl) Set oHtmlDom = CreateObject("htmlfile") With oHtmlDom .body.innerHTML = sResult sToken = .getElementsByTagName("input")(1).Value End With For n = 1 To 10 sPostData = "PageIndex=" & n & "&__RequestVerificationToken=" & sToken sResult = HtmlBasic("POST", sUrl, , sPostData) Call HtmlTable(sResult) Next n End Sub Function HtmlBasic(ByVal sVerb As String, ByVal sUrl As String, Optional ByVal sCharset As String = "utf-8", Optional ByVal sPostData As String = "") 'sVerb为发送的Html请求的方法,sUrl为具体的网址,sCharset为网址对应的字符集编码,sPostData为Post方法对应的发送body Dim oHTML As Object 'https://msdn.microsoft.com/en-us/library/windows/desktop/aa384106(v=vs.85).aspx Set oHTML = VBA.CreateObject("WinHttp.WinHttpRequest.5.1") With oHTML Select Case sVerb Case "GET" .Open "GET", sUrl, False Case "POST" .Open "POST", sUrl, False ' .SetRequestHeader "Referer", "http://www.77tj.org/tencent" .SetRequestHeader "Content-Type", "application/x-www-form-urlencoded" End Select .send (sPostData) '获取返回的字节数组 bResult = .ResponseBody '按照指定的字符编码显示 sResult = Byte2String(bResult, sCharset) Debug.Print sResult HtmlBasic = sResult End With Set oHTML = Nothing End Function Function Byte2String(bContent, ByVal sCharset As String) Const adTypeBinary = 1 Const adTypeText = 2 Const adModeRead = 1 Const adModeWrite = 2 Const adModeReadWrite = 3 Dim oStream As Object '创建流对象 Set oStream = CreateObject("ADODB.Stream") With oStream '打开流 .Open '设置为字节模式 .Type = adTypeBinary '写入字节 .write bContent '将位置定位在第一个字节 .Position = 0 '设置为文本模式 .Type = adTypeText '设置编码的字符集 .Charset = sCharset '读取编码后的文本 Byte2String = .ReadText '关闭流对象 .Close End With End Function '提取网页表格的代码 Sub HtmlTable(ByVal sHtml As String) '网页html文档对象 Dim oHtmlDom As Object '网页表格对象 Dim oTable As Object '网页表格行对象 Dim oRows As Object '网页表格单元格对象 Dim oCells As Object '抓取的数据存放的excel表格对象 Dim oWK As Worksheet Set oWK = Sheet1 iRow = oWK.Range("a65536").End(xlUp).Row + 1 Set oHtmlDom = CreateObject("htmlfile") With oHtmlDom .body.innerHTML = sHtml Set oTable = .getElementsByTagName("table")(0) 'Set obj = getElementById("id") With oTable Set oRows = .Rows For i = 1 To oRows.Length - 1 Set oCells = oRows(i).Cells For j = 0 To oCells.Length - 1 oWK.Cells(iRow, j + 1) = oCells(j).innertext Next j iRow = iRow + 1 Next i End With End With Set oHtmlDom = Nothing Set oTable = Nothing Set oRows = Nothing Set oCells = Nothing End Sub
发表评论