不写R包的分析师不是好全栈

Python 正则表达式II

    JsPy&Others









Python 正则表达式(二)

加载相关库







In [4]:



from nltk.book import *
import nltk
import re









 Introductory Examples for the NLTK Book 
Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908









  • 输入测试文本:








In [24]:



f = open("Alcie.txt","r")
text = f.readlines(5)
raw = text[0]+text[1]+text[2]+text[3]
## raw = "".join(text)
print(raw)









Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations
in it, "and what is the use of a book,"thought Alice "without pictures or conversation?'








\s和\w,对文本进行分割

尝试将这个文本进行分割:


  • 使用r’ ‘分割

  • 使用r’[ \t\n]+’分割

  • 使用正则表达r’\s+’分割





re.split(r’ ‘,raw)
re.split(r’[ \t\n]+’,raw)
re.split(r’\s+’,raw)





结果比较长就不贴出来了



  • 第一个正则表达是有问题的,仅使用空格进行分割会忽略制表符(\t)和换行符(\n)

  • 第二个正则表达匹配了一个或者多个空格/制表/换行符,可以合适的对文本进行分割

  • 第三个正则表达最简洁,和第二个效果相同,是re库的内置缩写,\s代替了空格/制表/换行符

  • 值得一提的是\S代表了\s的补,也就是所有非空格/制表/换行符


再次,使用\w来进行分割

\w代替了所有字母,数字和下划线


r’\w’ = r’[a-zA-Z0-9_]’


同样的,\W代替了\w的补,所有非字母,数字和下划线








In [30]:



re.split(r'\W+',raw)[0:9]







Out[30]:

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting']




In [32]:



re.findall(r'\w+',raw)[0:9]







Out[32]:

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting']







再看这样一个例子:


加入我们想分割: “I’m very tired!”


其中”‘m”部分想单独取出,用之前的方法是行不通的:








In [34]:



raw2 = "I'm very tired!"
re.findall(r'\w+',raw2)







Out[34]:

['I', 'm', 'very', 'tired']




In [35]:



re.findall(r'\w+|\S\w',raw2)







Out[35]:

['I', "'m", 'very', 'tired']







其他常用的正则表达式的符号还有\d,\D(代表数字)














































符号功能
\b词边界
\d任一十进制数字 r’[0-9]’
\D任何非数字字符 r’[^0-9]’
\s任何空白字符 r’\t\n\r\f\v’
\S任何非空白字符 r’[^\t\n\r\f\v]’
\w任何字母数字下划线 r’[a-zA-Z0-9_]’
\W任何非字母数字下划线 r’[^a-zA-Z0-9_]’
\t制表符
\n换行符










nltk中的分词

nltk中也提供了分词的形式:



nltk.regexp_tokensize()


据说这个函数的分词效率更高,并避免了括号特殊处理的需要,看下面的例子:








In [39]:



text = "That U.S.A poster-print costs $12.40…"
pattern = r'''(?x) ## allow verbose regexps
([A-Z].)+ ## abbreviations, e.g U.S.A
| \w+(-\w+)* ## words like A-B
| \$?\d+(.\d+)?%? ## e.g $12.3 40%
| ... ## ellipsis
| [][.,:"'?():-_`] ## seperate tokens
'''





In [40]:



nltk.regexp_tokenize(text,pattern)







Out[40]:

['Th', 'at', 'U.S.A ', 'poster-print', 'costs', '$12.40', '…']




In [ ]:








page PV:  ・  site PV:  ・  site UV: