正则和热库和股票案例

2019-11-03

字数统计: 1.7k字 | 阅读时长: 7分

阅读量

正则表达式

regular expression RE

1	'PY....' = PY+

通用的字符串表达框架
简洁表达一组字符串的表达式
针对字符串表达简介和特征思想的工具
判断某字符串的特征归属
表达文本类型的特征
同时查找或替换一组字符串
匹配字符串的全部或部分
编译：将符合正则表达式语法的字符串转换成正则表达式特征

1
2
3

regex = 'P(Y|YT|YTH|YTHO)?N'

p = re.compile(regex)

淘宝示例

# 正则逻辑有问题
import  requests
import re
def getHTMLText(url):
    try:
        r=requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def parsePage(ilt,html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split('"')[1])
            ilt.append([price,title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号","价格","商品名称"))
    count = 0
    for g in ilt:
        count = count+1
        print(tplt.format(count,g[0],g[1]))

def main():
    goods='书包'
    depth=2
    start_url = 'https://uland.taobao.com/sem/tbsearch?'+goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s='+ str(44*i)
            html = getHTMLText(url)
            parsePage(infoList,html)
        except:
            continue
    printGoodsList(infoList)

main()

正则表达式语法

正则表达式由字符和操作符构成
常用操作符

语法示例

IP地址字符串形式的正则表达式（IP地址分四段，每段0-255）

代码示例

import re
# match = re.search(r'[1-9]\d{5}','BIT 100081')
# if match:
#     print(match.group(0))


# match = re.match(r'[1-9]\d{5}','BIT 100081')
#match.group(0)# 不能用空的函数去调用后面的方法
# match = re.match(r'[1-9]\d{5}','100081 BIT')#调整参数
# if match:
#     print(match.group(0))

# ls = re.findall(r'[1-9]\d{5}','BIT100081 TSU100084')
# print(ls)


# ls = re.split(r'[1-9]\d{5}','BIT100081 TSU100084')
# ls = re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit=1)
# print(ls)


# for m in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'):#迭代的操作
#     if m:
#         print(m.group(0))


# a = re.sub(r'[1-9]\d{5}',':zipcode','BIT100081 TSU100084')
a = re.sub(r'[1-9]\d{5}',':zipcode','BIT100081 TSU100084',1)
print(a)

正则对象

import re
m = re.search(r'[1-9]\d{5}','BIT100081 TUS100084')
print(type(m))
print(m.string)
print(m.re)
print(m.pos)
print(m.endpos)
print(m.group(0))
print(m.start())
print(m.end())
print(m.span())

RE库的基本使用

import re

正则表达式的表示类型： raw string类型，即表示为：r’text’

1	r'[1-9]\d{5}' #中国地区邮政编码

re库的主要功能函数

re.search(pattern,string,flags=0) 返回值是match对象

pattern: 正则表达式的字符串或原生字符串

string：待匹配的字符串

flags：正则表达式使用时的控制标记，常用共有3个

re.I re.IGNORECASE 忽略正则表达式的大小写
re.M re.MULTILINE 正则表达式中的^操作符能够将给定字符串的每行当作匹配开始
re.S re.DOTALL 正则表达式中的.操作符能够匹配所有字符

re.match(pattern,string,flags=0) 返回值是match对象
re.findall(pattern,string,flags=0) 返回值是列表类型
re.split(pattern,string,maxsplit=0,flags=0) 返回值是列表类型

maxsplit: 最大分割数，剩余部分作为最后一个元素输出

re.finditer(pattern,string,flags=0) 返回值是匹配结果的迭代类型，每个迭代类型是一个match对象
re.sub(pattern,repl,string,count=0,flags=0) 在一个字符串中替换所有匹配正则表达式的子串，返回值是替换后的字符串

repl：替换的字符串 count：替换次数

RE库的另一种等价用法

rst = re.search(r'[1-9]\d{5}','bit 100081') #函数式用法：一次操作

pat = re.compile(r'[1-9]\d{5}')

rst = pat.search('bit 100081') #面向对象的用法

re.compile(pattern,flags=0) 将正则表达式的字符串编译成正则表达式对象

RE库的match对象

match对象的属性

属性	说明
.string	待匹配的文本
.re	匹配时使用的pattern对象（正则表达式）
.pos	正则表达式搜索文本的开始位置
.endpos	正则表达式搜索文本的结束位置

match对象的常用方法

方法	说明
.group(0)	获得匹配后的字符串
.start()	匹配字符串在原始字符串中的开始位置
.end()	匹配字符串在原始字符串中的结束位置
.span()	返回(.start(),.end())

re库的贪婪匹配和最小匹配

re库默认采用贪婪匹配，即输出匹配最长的字符串
最小匹配操作符

操作符	说明
*?	前一个字符的0次或无限次扩展，最小匹配
+?	前一个字符的1次或无限次扩展，最小匹配
??	前一个字符的0次或1次扩展，最小匹配
{m,n}?	前一个字符的m至n次（含n）扩展，最小匹配

代码

import re
m = re.search(r'PY.*N','PYANBNCNDN')
print(m.group(0))
match = re.search(r'PY.*?N','PYANBNCNDN')
# print(match)
print(match.group(0))


# 补充：可以编译成正则对象在调用
# regex=re.compile()
# regex.search()
# regex.match()

股票数据定向爬虫

目的：获得上交所和深交所所有股票的名称和交易信息
输出：保存到文件中
技术路线： requests-bs4-re
候选网站：新浪股票百度股票
- 选取原则：股票信息静态存在于HTML页面中，非js代码生成，没有robots协议限制
- 选取代码：查看源代码
- 选取心态：不要纠结于某个网站，多尝试
程序结构设计：
1. 从东方财富网获取股票列表
2. 根据股票列表逐个到百度股票中获取个股信息
3. 结果存储到文件
编程步骤
1. 设计函数
2. main函数编写

代码

import requests
from bs4 import BeautifulSoup
import traceback
import re
def getHTMLText(url,code='utf-8'):
    try:
        r=requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst,stockURL):
    html = getHTMLText(stockURL,'GB2312')
    soup = BeautifulSoup(html,'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            herf = i.attrs['herf']
            lst.append(re.findall(r"[s][hz]\d{6}",herf[0]))
        except:
            continue
    return ""

def getStockInfo(lst,stockURL,fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock +".html"
        html = getHTMLText(url)
        try:
            if html =="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html,'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
            name = soup.find('div',attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称':name.txt.solit()[0]})

            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            with open(fpath,'a',encoding='utf-8') as f:
                f.write(str(infoDict)+'\n')
                count = count+1
                print('\r当前速度：{.2f}%'.format(count*100/len(lst)),end='')
        except:
            count = count+1
            print('\r当前速度：{.2f}%'.format(count * 100 / len(lst)), end='')
            traceback.print_exc()
            continue
    return ""

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'http://quote.eastmoney.com/us/BIDU.html'
    output_file = 'D://Python//code//images//a.txt'
    slist = []
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()