手把手教你下载微信公众号所有历史文章

发布时间：2023-04-19 23:50:29

发现个别公众号的文章不错，想批量下载下来慢慢看，可是在微信里总是看的不爽，有什么好办法么？

曾经试过在微信PC端中打开公众号的历史消息页面，然后在浏览器打开，就能直接在html中拿到所有历史文章的链接，用python直接爬就行了。但是最近尝试了后发现，这个方法已经不行了，公众号的历史消息页在默认浏览器中访问提示“必须在微信客户端打开链接”：

已经不能访问了：

因为是https的页面，所以无法也用fiddler直接抓取http请求进行分析。

要抓取https内容，需要自建证书，可以用fiddler在PC和手机中开启代理抓取，也可以用Anyproxy(一款ali开源的代理工具)抓取，因为Anyproxy定制性较强，所以用Anyproxy试试。

Anyproxy安装与设置步骤：

1.下载并安装node.js

2.node命令行运行:npm install -g anyproxy

3.生成自建的证书:nyproxy --root

4.启动Anyproxy:anyproxy -i (参数i表示解析https)

5.确保手机和电脑在一个wifi中，电脑访问http://localhost:8002/qr_root会显示二维码，手机浏览器扫码后下载证书并信任安装

6.在手机中的wifi信号中设置代理为电脑，这样手机上所有的流量都会经过电脑上开启的代理，并通过自己自建的证书访问https。

完成Anyproxy的安装和手机设置后，手机上的所有网络流量都会被代理抓到，包括https的请求,控制台:http://localhost:8002

在手机上访问公众号的历史信息页，分析抓取到的信息，发现第一页返回的是一段html，其中有段json，存了第一页10条消息的名称，链接等信息；后续的页面是每10条消息一段json，可以直接处理。

理解了返回信息后，可以对Anyproxy进行定制：

1.修改anyproxy的“rule_default.js”文件，node.js:

othersjson.txt：

replaceServerResDataAsync: function(req,res,serverResData,callback){

if(/mp\/profile_ext\?action=home/i.test(req.url)){

var fs = require('fs');

fs.appendFile('d:\\wxout\\firstpagehtml.txt', serverResData.toString(), function (err) {

if (err) throw err;

})

callback(serverResData);

}else if(/mp\/profile_ext\?action=getmsg/i.test(req.url)){

var json = JSON.parse(serverResData.toString());

if (json.general_msg_list != []) {

var fs = require('fs');

fs.appendFile('d:\\wxout\\othersjson.txt', json.general_msg_list + '\n', function (err) {

if (err) throw err;

})}

callback(serverResData);

}else{

callback(serverResData);

}

3.完成定制后重启Anyproxy，在手机上打开公众号历史消息页面，一直往下滑动，直到所有消息显示完毕。

4.查看txt文件，确认是否成功抓取

第一页消息的html：

后面消息的json：

主要目标是抓取下来的文章链接，试一下抓到历史文章的链接：

"http://mp.weixin.qq.com/s?__biz=MzA4OTA1NDIwOQ==&amp;mid=2649025253&amp;idx=1&amp;sn=c2a84abaffdfecddb9c3b517bb12135f&amp;chksm=8830878bbf470e9d9dee06e183e3e62ece31e59a769effeef3b24ef6fcbcb2c0681271437ec9&amp;scene=27#wechat_redirect"

为文章的永久链接，写段python分别对两个文件进行处理，获取文章ID，文章标题，文章链接这些我们需要的信息写入outurl.txt文件:

def getfirstpage():
    f = codecs.open(r'c:\python\weixinpro\firstpatehtml.txt', encoding='UTF-8')  # 先读取第一页的html并解析
    fout = open(r'c:\python\weixinpro\outurls.txt', 'a')
    for line in f:
        if 'msgList' in line:
            newline = line.replace('var msgList = \'', '').replace('\';', '').replace('&quot;', '"').strip()
            hjson = json.loads(newline)
            for x in hjson['list']:
                if 'app_msg_ext_info' in line:
                    fout.write(str(x['comm_msg_info']['id']) + ',' + x['app_msg_ext_info']['title'].decode('utf-8') + ',' +
                           x['app_msg_ext_info']['content_url'].replace('\\/', '/'))
                    fout.write('\n')
    fout.close()
    f.close()

def getotherpage():
    f = codecs.open(r'c:\python\weixinpro\othersjson.txt', encoding='UTF-8')  # 读取后面所有的json并解析
    fout = open(r'c:\python\weixinpro\outurls.txt', 'a')
    for line in f:
        hjson = json.loads(line)
        for x in hjson['list']:
            if 'app_msg_ext_info' in x:
                fout.write(str(x['comm_msg_info']['id']) + ',' + x['app_msg_ext_info']['title'].decode('utf-8') + ',' +
                           x['app_msg_ext_info']['content_url'].replace('\\/', '/'))
                fout.write('\n')
    fout.close()
    f.close()

整理后的文章信息，只取了文章id，标题，链接：

有了以上信息基本就已经搞定了，如果想在电脑上看，可以写段爬虫脚本，将整个页面爬到本地(包含图片)：

抓取网页代码如下：

#!/usr/bin/python
#coding:utf-8
from HTMLParser import HTMLParser
import urllib,os
import sys
import codecs
reload(sys)
sys.setdefaultencoding('gbk')

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.links =[]
    def handle_starttag(self, tag, attrs):
        print "Encountered the beginning of a %s tag" % tag
        if tag == 'img' or tag == "script":
            for (variable,value) in attrs:
                if variable == "data-src" or variable == "href":
                    self.links.append(value)
        if tag == "link":
            dic = dict(attrs)
            if dic['rel'] == "stylesheet":
                self.links.append(dic['href'])

def download(pagename,html_code,durl,links,downurl):
    downfolder = (downurl + pagename+ '_files\\').encode('gbk')
    if not os.path.exists(downfolder):
        os.mkdir(downfolder)
        print 123
    upurl = durl.rsplit('/',1)[0]
    for link in links:
        fname = str(links.index(link))
        localpath = '%s%s' % (downurl + pagename + '_files\\',fname)
        if link[0:3] == '../':
            downlink = link[3:]
            durl = upurl
        else:
            downlink = link

        try:
            urllib.urlretrieve(downlink, localpath.encode('gbk'))
        except Exception,error:
            print 'donw error',error
        else:
            html_code = html_code.replace(link,localpath.encode('utf8')).replace('data-src','src')

    open(downurl + pagename + '.html', 'w').write(html_code)
    return True

if __name__ == "__main__":
    f = codecs.open(r'c:\python\weixinpro\outurls.txt', encoding='UTF-8')
    for line in f:
        wxlist = line.split(',')
        url = wxlist[2]
        pagename = wxlist[1]
        html_code = urllib.urlopen(url).read()
        hp = MyHTMLParser()
        hp.feed(html_code)
        hp.close()
        durl = url.rsplit('/', 1)[0]
        downurl = 'C:\\wxdown\\'
        download(pagename, html_code, durl, hp.links,downurl)

如果想在手机上看，可以参考wiz2url写段py脚本，推送到wiz笔记或者Evernote等笔记应用，在手机上存下来慢慢整理阅读：

PS：已经抓到文章ID了，增量问题可以自行解决。

大家在看

手把手教你下载微信公众号所有历史文章

福利来了「经典藏歌弹唱金曲下载地址」你找不到的Demo(小样)都在这里!

干货 | 没有VIP音乐会员?这个神器帮你解决,支持所有音乐平台的歌曲下载

听歌神器,满足你“所有”下载需求

【对酒当歌】属于我的自由和不安分全部都给你

资料领取 |公认最佳美语教辅全套(MP3音频+PDF文本)免费下载!助你攻克口语发音问题!

高峰调查你愿意花钱下载歌曲吗?

你会到工地看我吗?让所有工程人泪奔的神曲…

不负岁月不负你txt全集下载17章

强单合集丨DJ,2015年这些歌曲你都打过吗?