这段时间有个朋友想导出微信里面的账单信息,后来发现微信的反爬虫还是很厉害的,花了点时间去分析。
一、采用传统模拟http抓取
抓取的主要URL:https://wx.tenpay.com/userroll/userrolllist,其中后面带上三个参数,具体参数见代码,其中exportkey这参数是会过期的,userroll_encryption和userroll_pass_ticket 这两个参数需要从cookie中获得,应该是作为获取数据的标识,通过抓包也看不出端倪,应该是微信程序内部生成的,如果使用微信开发着工具登录后直接访问网址有的时候可以访问返回数据,但是只是在较短的时间内有效,而且当返回会话超时后,继续使用网页访问就会被限制,一直提示会话超时,应该是在网页和移动端中exportkey有不同的时间和访问次数的限制。
之后想通过破解seesion的方式,研究了一下,发现这是不可能的,想要破解session需要搞定wx.login,而wx.login是微信提供的,想要破解难度应该不用我说了。
二、解决exportkey 这个key和Cookie的获取
需要的工具:
1、安卓/苹果手机
2、Fiddler(抓包工具)
搞过爬虫的都知道Fiddler,具体操作就不多说了,设置好代理和开启Fiddler后,抓取url中的exportkey和相应的Cookie,用于接下来的数据抓取。
三、上代码
代码写的不是很好,若有错误还望各位大大指正。
# coding:utf-8 import datetime import time import urllib import urllib.request import json import sys import io import ssl from DBController import DBController #数据库 #设置系统编码格式 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #解决访问Https时不受信任SSL证书问题 ssl._create_default_https_context = ssl._create_unverified_context class MainCode: def __init__(self, url=""): self.url = url self.dbController = DBController() # 数据库控制 self.userroll_encryption = "uoxQXsCenowxj0G0ppRKBg8iHRPZwZKaUZB0ka1Y5apUuQnKkZTsA/2RMhBPGyMdiHS8QXk8y2JeLgqTPqZPU9fkrCUp+TIQPkHH/uExAwKeBFLute0ztdHaC6GJUJ2+/R8NGWGe16hSKc6L1+LvAw==" self.userroll_pass_ticket = "V7oum4glDbdaAwibC8mcuTizGIKmC9A/Y/V12qASuDALdRMveHcRHv1QXamFk27Z" # self.last_bill_id = "" # self.last_bill_type = "" # self.last_create_time = "" # self.last_trans_id = "" self.last_item = {} self.num= 0 #获取网页信息 def get_html(self, url, maxTryNum=5): goon = True # 网络中断标记 obj = {} for tryNum in range(maxTryNum): try: # print(self.token) header = { "Accept": 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', "Accept-Encoding":'gzip, deflate, br', "Accept-Language":'zh-CN,zh;q=0.8', "Cache-Control":'max-age=0', "Connection": "keep-alive", "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Mobile/14C92 Safari/601.1 wechatdevtools/1.02.1810240 MicroMessenger/6.5.7 Language/zh_CN webview/15415760070117398 webdebugger port/32594", "Cookie":"userroll_encryption="+self.userroll_encryption+"; userroll_pass_ticket="+self.userroll_pass_ticket, "Host":"wx.tenpay.com", "Upgrade-Insecure-Requests":"1", } req = urllib.request.Request(url=url, headers=header) # 访问网址 result = urllib.request.urlopen(req, timeout=5).read() break except urllib.error.HTTPError as e: if tryNum < (maxTryNum - 1): print("尝试连接请求" + str(tryNum + 1)) # host = self.host2 time.sleep(5) else: print('Internet Connect Error!', "Error URL:" + url) goon = False break if goon: page = result.decode('utf-8') obj = json.loads(page) #print(obj) #print(page) else: print("--------------------------") return obj #保存到数据库 def save_info_to_db(self, item): select_sql = "SELECT count(*)as num FROM wx_order2 where trans_id = '%s'" % (item["trans_id"]) results = self.dbController.ExecuteSQL_Select(select_sql) if int(results[0][0]) == 0: sql = "INSERT INTO wx_order2 (bill_id, bill_type, classify_type, fee, fee_type, out_trade_no, pay_bank_name, payer_remark, remark, order_time, title, total_refund_fee, trans_id,fee_attr) VALUES ( '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s','%s','%s')" % ( str(item['bill_id']), str(item['bill_type']), str(item['classify_type']), str(item['fee']), str(item['fee_type']) , str(item['out_trade_no']), str(item['pay_bank_name']), str(item['payer_remark']), str(item['remark']), str(item['order_time']), str(item['title']), str(item['total_refund_fee']), str(item['trans_id']), str(item['fee_attr']) ) # print(sql) try: self.dbController.ExecuteSQL_Insert(sql) # self.log.info("插入数据成功") except Exception as e: print("save_info_to_db:",e) return #从获取的网页信息中过滤所需要的信息 def get_data(self,url): res_obj = self.get_html(url) this_page_num = 0 #若返回的ret_code== 0 则说明获取数据成功 if res_obj['ret_code'] == 0: record_list = res_obj['record'] self.last_bill_id = res_obj['last_bill_id'] self.last_bill_type = res_obj['last_bill_type'] self.last_create_time = res_obj['last_create_time'] self.last_trans_id = res_obj['last_trans_id'] num = 1 this_page_num = len(record_list) # order = record_list[i] for order in record_list: bill_id = order['bill_id'] bill_type = order['bill_type'] classify_type = order['classify_type'] fee = order['fee'] #账单金额 fee = fee * 0.01 fee = round(fee, 2) #对金额保留两位小数 fee_type = order['fee_type'] #金额类型 out_trade_no = order['out_trade_no'] #账单编号 pay_bank_name = order['pay_bank_name'] #支付的银行 payer_remark =order['payer_remark'] #支付说明 remark = order['remark'] #账单说明 order_time = datetime.datetime.fromtimestamp(order['timestamp']) #将时间戳转为时间 title = order['title'] #账单标题 title = title.replace(',','').replace('.','').replace("'",'') #去除英文逗号和单引号 total_refund_fee = "0" trans_id = order['trans_id'] fee_attr = order['fee_attr'] #title = self.remove_emoji(title) fee_attr = order['fee_attr'] pay_type = "" if bill_type == 1: pay_type= "支付" elif bill_type == 2: pay_type = "充值" elif bill_type == 4: pay_type = "转账" elif bill_type == 6: pay_type="红包" else: pay_type = str(bill_type) if fee_attr == "positive": fee_attr = "收入" elif fee_attr == "negtive": fee_attr = "支出" elif fee_attr == "neutral": fee_attr = "提现" item = {} item['bill_id'] = bill_id item['bill_type'] =bill_type item['classify_type'] = classify_type item['fee'] = fee item['fee_type'] = fee_type item['out_trade_no'] = out_trade_no item['pay_bank_name'] = pay_bank_name item['payer_remark'] = payer_remark item['remark'] = remark item['order_time'] = order_time item['title'] = title item['total_refund_fee'] = total_refund_fee item['trans_id'] = trans_id item['fee_attr'] = fee_attr # title = self.remove_emoji(title) if bill_id != '': self.last_item['last_bill_id'] = bill_id self.last_item['last_bill_type'] = bill_type self.last_item['last_create_time'] = order['timestamp'] self.last_item['last_trans_id'] = trans_id try: print(str(self.num),self.last_item,end='\n') self.num += 1 time.sleep(0.2) self.save_info_to_db(item) #print(str(num)+" 时间:" + str(order_time) + " 账单标题:" + title + " 说明:"+ str(remark)+ " " +str(pay_type) +"金额:" + str(fee) + " 支付方式:"+ str(pay_bank_name)+" 类型:" + str(pay_type) +" fee_attr:"+str(fee_attr)+ '\n',end='') except Exception as e: print(e,end='\n') num = num+1 else:#若获取数据不成功,打印原因 print(res_obj) return this_page_num #实例化 maincode = MainCode(); #设置Cookie参数 maincode.userroll_encryption = "6Ow68aKrAz70mEczqeevA2gOXbr9H2a7+2ite6uuyWFdB6j1+SLhlaCNpYA6RjmaOI7IfCi9PXjQsrZPFIs1SMn38Uxr04GJsxMuSO/9wG+eBFLute0ztdHaC6GJUJ2+vmo+JIw351su8RiFxSagwA==" maincode.userroll_pass_ticket = "i0Co+55KSEjmFjfFZqMG14hasW4qtKFtbj0FiErcSzHY0afkFqHGib3YfsAZWcaG" #用于非第一页的数据抓取 #maincode.last_item['last_bill_id'] = "2ce3d65b20a10700b2048d68" #maincode.last_item['last_bill_type'] = "4" #maincode.last_item['last_create_time'] = "1540809516" #maincode.last_item['last_trans_id'] = "1000050201201810290100731805325" #设置每次返回的数量 count = "20" #exportkey 需要从Fiddler 抓包获取,有一定的时间限制 exportkey ="A%2BsIJaTGZksgZWPLtSKiyos%3D" #抓取的URL url ="https://wx.tenpay.com/userroll/userrolllist"+count+"&exportkey="+exportkey+"&sort_type=1" for page in range(0,10): #记录当前页返回的数据数量 this_page_num = 0 #第一页 if page == 0: this_page_num = maincode.get_data(url) #从第二页开始需要增加上一页最后一个item的部分参数,进行下一页的数据的抓取 else: url = "https://wx.tenpay.com/userroll/userrolllist"+count+"&exportkey="+exportkey+"&sort_type=1"+"&last_bill_id="+str(maincode.last_item['last_bill_id'])+"&last_bill_type="+str(maincode.last_item['last_bill_type'])+"&last_create_time="+str(maincode.last_item['last_create_time'])+"&last_trans_id="+str(maincode.last_item['last_trans_id'] + "&start_time="+str(maincode.last_item['last_create_time'])) print(url) this_page_num = maincode.get_data(url) #如果数量少于20个则跳出循环,抓取结束 if this_page_num < 20: break time.sleep(0.5) print(maincode.last_item)
因为是帮朋友抓取的,能实现就可以了。之后若有需要再继续优化代码吧!
总结
以上所述是小编给大家介绍的Python3 抓取微信账单信息,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对网站的支持!
如果你觉得本文对你有帮助,欢迎转载,烦请注明出处,谢谢!
稳了!魔兽国服回归的3条重磅消息!官宣时间再确认!
昨天有一位朋友在大神群里分享,自己亚服账号被封号之后居然弹出了国服的封号信息对话框。
这里面让他访问的是一个国服的战网网址,com.cn和后面的zh都非常明白地表明这就是国服战网。
而他在复制这个网址并且进行登录之后,确实是网易的网址,也就是我们熟悉的停服之后国服发布的暴雪游戏产品运营到期开放退款的说明。这是一件比较奇怪的事情,因为以前都没有出现这样的情况,现在突然提示跳转到国服战网的网址,是不是说明了简体中文客户端已经开始进行更新了呢?
更新动态
- 凤飞飞《我们的主题曲》飞跃制作[正版原抓WAV+CUE]
- 刘嘉亮《亮情歌2》[WAV+CUE][1G]
- 红馆40·谭咏麟《歌者恋歌浓情30年演唱会》3CD[低速原抓WAV+CUE][1.8G]
- 刘纬武《睡眠宝宝竖琴童谣 吉卜力工作室 白噪音安抚》[320K/MP3][193.25MB]
- 【轻音乐】曼托凡尼乐团《精选辑》2CD.1998[FLAC+CUE整轨]
- 邝美云《心中有爱》1989年香港DMIJP版1MTO东芝首版[WAV+CUE]
- 群星《情叹-发烧女声DSD》天籁女声发烧碟[WAV+CUE]
- 刘纬武《睡眠宝宝竖琴童谣 吉卜力工作室 白噪音安抚》[FLAC/分轨][748.03MB]
- 理想混蛋《Origin Sessions》[320K/MP3][37.47MB]
- 公馆青少年《我其实一点都不酷》[320K/MP3][78.78MB]
- 群星《情叹-发烧男声DSD》最值得珍藏的完美男声[WAV+CUE]
- 群星《国韵飘香·贵妃醉酒HQCD黑胶王》2CD[WAV]
- 卫兰《DAUGHTER》【低速原抓WAV+CUE】
- 公馆青少年《我其实一点都不酷》[FLAC/分轨][398.22MB]
- ZWEI《迟暮的花 (Explicit)》[320K/MP3][57.16MB]