Python 正则表达式(二)¶

加载相关库¶

from nltk.book import *import nltkimport re

 Introductory Examples for the NLTK Book Loading text1, …, text9 and sent1, …, sent9Type the name of the text or sentence to view it.Type: 'texts()' or 'sents()' to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908

输入测试文本:

f = open("Alcie.txt","r")text = f.readlines(5)raw = text[0]+text[1]+text[2]+text[3]## raw = "".join(text)print(raw)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book,"thought Alice "without pictures or conversation?'

\s和\w,对文本进行分割¶

尝试将这个文本进行分割:¶

使用r’ ‘分割

使用r’[ \t\n]+’分割

使用正则表达r’\s+’分割

结果比较长就不贴出来了

第一个正则表达是有问题的,仅使用空格进行分割会忽略制表符(\t)和换行符(\n)

第二个正则表达匹配了一个或者多个空格/制表/换行符,可以合适的对文本进行分割

第三个正则表达最简洁,和第二个效果相同,是re库的内置缩写,\s代替了空格/制表/换行符

值得一提的是\S代表了\s的补,也就是所有非空格/制表/换行符

再次,使用\w来进行分割¶

\w代替了所有字母,数字和下划线

r’\w’ = r’[a-zA-Z0-9_]’

同样的,\W代替了\w的补,所有非字母,数字和下划线

re.split(r'\W+',raw)[0:9]

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting']

re.findall(r'\w+',raw)[0:9]

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting']

再看这样一个例子:

加入我们想分割: “I’m very tired!”

其中”‘m”部分想单独取出,用之前的方法是行不通的:

raw2 = "I'm very tired!"re.findall(r'\w+',raw2)

['I', 'm', 'very', 'tired']

re.findall(r'\w+|\S\w',raw2)

['I', "'m", 'very', 'tired']

其他常用的正则表达式的符号还有\d,\D(代表数字)

符号	功能
\b	词边界
\d	任一十进制数字 r’[0-9]’
\D	任何非数字字符 r’[^0-9]’
\s	任何空白字符 r’\t\n\r\f\v’
\S	任何非空白字符 r’[^\t\n\r\f\v]’
\w	任何字母数字下划线 r’[a-zA-Z0-9_]’
\W	任何非字母数字下划线 r’[^a-zA-Z0-9_]’
\t	制表符
\n	换行符

nltk中的分词¶

nltk中也提供了分词的形式:

nltk.regexp_tokensize()

据说这个函数的分词效率更高,并避免了括号特殊处理的需要,看下面的例子:

text = "That U.S.A poster-print costs $12.40…"pattern = r'''(?x) ## allow verbose regexps      ([A-Z].)+    ## abbreviations, e.g U.S.A    | \w+(-\w+)*   ## words like A-B    | \$?\d+(.\d+)?%? ## e.g $12.3 40%    | ...       ## ellipsis    | [][.,:"'?():-_`] ## seperate tokens    '''

nltk.regexp_tokenize(text,pattern)

['Th', 'at', 'U.S.A ', 'poster-print', 'costs', '$12.40', '…']