« Обработка естественного языка с помощью Python » ( прочитайте мой обзор ) содержит множество мотивирующих примеров для обработки естественного языка, ориентированных на NLTK , который, помимо прочего, хорошо собирает наборы данных и алгоритмы NLP в одну библиотеку.
Давайте возьмем один из сонетов Шекспира и посмотрим, сможем ли мы порекомендовать альтернативные рифмы:
import nltk from nltk.corpus import cmudict sonnet = ['O thou, my lovely boy, who in thy power', \ 'Dost hold Time\'s fickle glass, his sickle, hour;', \ 'Who hast by waning grown, and therein showst', \ 'Thy lovers withering as thy sweet self growst;', \ 'If Nature, sovereign mistress over wrack,', \ 'As thou goest onwards, still will pluck thee back,', \ 'She keeps thee to this purpose, that her skill', \ 'May time disgrace and wretched minutes kill.', \ 'Yet fear her, O thou minion of her pleasure!', \ 'She may detain, but not still keep, her treasure:', \ 'Her audit, though delayed, answered must be,', \ 'And her quietus is to render thee.'] |
Сначала мы помечаем сонет и набираем последнее слово каждой строки.
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize tokens = [wordpunct_tokenize(s) for s in sonnet] punct = set(['.', ',', '!', ':', ';']) filtered = [ [w for w in sentence if w not in punct ] for sentence in tokens] last = [ sentence[len(sentence) - 1] for sentence in filtered] |
Одним из корпусов, включенных в NLTK, является словарь произношения CMU, который дает нам произношение для слов, поэтому следующим шагом является поиск произношения:
from nltk.corpus import cmudict [[(w, p) for (word, pron) in cmudict.entries() if word == w] for w in last] [[('power', ['P', 'AW1', 'ER0'])], [('hour', ['AW1', 'ER0']), ('hour', ['AW1', 'R'])], [], [], [('wrack', ['R', 'AE1', 'K'])], [('back', ['B', 'AE1', 'K'])], [('skill', ['S', 'K', 'IH1', 'L'])], [('kill', ['K', 'IH1', 'L'])], [('pleasure', ['P', 'L', 'EH1', 'ZH', 'ER0'])], [('treasure', ['T', 'R', 'EH1', 'ZH', 'ER0'])], [('be', ['B', 'IY1']), ('be', ['B', 'IY0'])], [('thee', ['DH', 'IY1'])]] |
К сожалению это распадается на неузнаваемые слова. В реальной системе нам может потребоваться запасной вариант, такой как сравнение последних букв слова с другими словами в словаре. Реализация этого приведет к сокращению букв до общего набора и удалению тихих букв, например, «удалить e из конца слов», уменьшению некоторых буквенных комбинаций до общих звуков, «т.е.» -> «y», «c» -> « k »и т. д. Очевидно, что это тесно связано с английским, но поможет с выдуманными или редкими словами.
We can join the prior result to our list of words, and see pronunciations:
[[(w, p) for (word, pron) in cmudict.entries() if word == w] for w in last] [[('power', ['P', 'AW1', 'ER0'])], [('hour', ['AW1', 'ER0']), ('hour', ['AW1', 'R'])], [], [], [('wrack', ['R', 'AE1', 'K'])], [('back', ['B', 'AE1', 'K'])], [('skill', ['S', 'K', 'IH1', 'L'])], [('kill', ['K', 'IH1', 'L'])], [('pleasure', ['P', 'L', 'EH1', 'ZH', 'ER0'])], [('treasure', ['T', 'R', 'EH1', 'ZH', 'ER0'])], [('be', ['B', 'IY1']), ('be', ['B', 'IY0'])], [('thee', ['DH', 'IY1'])]] |
We can retrieve the number of sounds in a similar fashion:
syllables = \ [[(word, len(pron)) for (word, pron) in cmudict.entries() if word == w] \ for w in last] [[('power', 3)], [('hour', 2), ('hour', 2)], [], [], [('wrack', 3)], [('back', 3)], [('skill', 4)], [('kill', 3)], [('pleasure', 5)], [('treasure', 5)], [('be', 2), ('be', 2)], [('thee', 2)]] |
Combining the last two together, we can get the word, number of sounds, and phenome list in one tuple. What the following code shows is that even though Python is simpler than Perl, it’s still easy to build data structures that are impossible to read.
syllables = \ [[(w, len(p), p) for (w, p) in cmudict.entries() if word == w] \ for w in last] [[('power', 3, ['P', 'AW1', 'ER0'])], [('hour', 2, ['AW1', 'ER0']), ('hour', 2, ['AW1', 'R'])], [], [], [('wrack', 3, ['R', 'AE1', 'K'])], [('back', 3, ['B', 'AE1', 'K'])], [('skill', 4, ['S', 'K', 'IH1', 'L'])], [('kill', 3, ['K', 'IH1', 'L'])], [('pleasure', 5, ['P', 'L', 'EH1', 'ZH', 'ER0'])], [('treasure', 5, ['T', 'R', 'EH1', 'ZH', 'ER0'])], [('be', 2, ['B', 'IY1']), ('be', 2, ['B', 'IY0'])], [('thee', 2, ['DH', 'IY1'])]] |
Now that we have this, we need to write a test to see whether two words rhyme. To do this, we remove the first sound, and compare the ends of two words:
def rhymes(s): try: (w, l, p) = s[0] try: return [wt for (wt, pt) in cmudict.entries() \ if l == len(pt) and p[1:] == pt[1:] ] except: return [w] except: return [] [rhymes(s) for s in syllables] |
This returns a lot of results – much of it is good, but a lot is also garbage.
[['bauer', 'baur', 'bougher', 'bower', 'cower', 'dauer', 'dour', 'gauer', 'gower', 'hauer', 'hower', "how're", 'kauer', 'knauer', 'lauer', 'mauer', 'nauer', 'power', 'rauer', 'sauer', 'schauer', 'shower', 'sour', 'tauer', 'tower'], ['auer', 'ayer', 'ayer', 'eyer', 'for', 'her', 'hour', 'iyer', 'our', 'oyer', 'uher', 'were'], [], [], ['back', 'backe', 'bak', 'bakke', 'cac', 'caq', 'dac', 'dack', 'dak', 'fac', 'hack', 'hacke', 'haq', 'haque', 'jac', 'jack', 'jacques', 'knack', 'lac', 'lack', 'lak', 'mac', 'mack', 'macke', 'mak', 'nack', 'nacke', 'pac', 'pack', 'pak', 'paque', 'ptak', 'rack', 'rak', 'sac', 'sack', 'sak', 'schack', 'shack', 'shaq', 'tac', 't_a_c', 'tack', 'tacke', 'tak', 'wack', 'whack', 'wrack', 'yack', 'yak', 'zach', 'zack', 'zak'], ['back', 'backe', 'bak', 'bakke', 'cac', 'caq', 'dac', 'dack', 'dak', 'fac', 'hack', 'hacke', 'haq', 'haque', 'jac', 'jack', 'jacques', 'knack', 'lac', 'lack', 'lak', 'mac', 'mack', 'macke', 'mak', 'nack', 'nacke', 'pac', 'pack', 'pak', 'paque', 'ptak', 'rack', ... truncated, since this goes on for a while... |
One issue with the above list is a lot of these “words” are names or garbage tokens, but not really used in a literary sense. We can filter these down by joining to Wordnet, which is a corpus with some semantic information, and more focused on real words.
def rhymes(s): try: (w, l, p) = s[0] try: filtered = [wt for (wt, pt) in cmudict.entries() \ if l == len(pt) \ and p[1:] == pt[1:] \ and len(wordnet.synsets(wt)) > 0] return filtered except: return [w] except: return [] |
This is better – we get a lot less, although not all of these words could be used in context, and most of them are pretty obvious (if you look for a rhyming word, you can always run through the alphabet in your head, which doesn’t require software)
[['bower', 'cower', 'dour', 'power', 'shower', 'sour', 'tower'], ['hour', 'were'], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], ['skill'], ['bill', 'chill', 'dill', 'fill', 'gill', 'hill', 'kill', 'lille', 'mil', 'mill', 'nil', 'pill', 'rill', 'shill', 'sill', 'thill', 'till', 'will', 'zill'], ['pleasure'], ['treasure'], ['b', 'be', 'bee', 'c', 'd', 'de', 'dea', 'fee', 'g', 'gee', 'ghee', 'he', 'ji', 'kea', 'key', 'ki', 'knee', 'lea', 'lee', 'leigh', 'li', 'me', 'mi', 'ne', 'nee', 'ni', 'p', 'pea', 'pee', 'qi', 'quay', 're', 'sea', 'see', 'si', 't', 'te', 'tea', 'tee', 'ti', 'v', 'vi', 'wee', 'xi', 'yi', 'z', 'zea', 'zee'], ['b', 'be', 'bee', 'c', 'd', 'de', 'dea', 'fee', 'g', 'gee', 'ghee', 'he', 'ji', 'kea', 'key', 'ki', 'knee', 'lea', 'lee', 'leigh', 'li', 'me', 'mi', 'ne', 'nee', 'ni', 'p', 'pea', 'pee', 'qi', 'quay', 're', 'sea', 'see', 'si', 't', 'te', 'tea', 'tee', 'ti', 'v', 'vi', 'wee', 'xi', 'yi', 'z', 'zea', 'zee']] |
The next version is a minor improvement, checking the ends, rather than the whole words. Note that for very short words, it removes all the random garbage (all the single sound words were treated as matching, since we compared 0 length lists)
def rhymes(s): try: (w, l, p) = s[0] try: filtered = [wt for (wt, pt) in cmudict.entries() if l == len(pt) and p[-2:] == pt[-2:] and len(nltk.corpus.wordnet.synsets(wt)) > 0] return filtered except: return [w] except: return [] [['bower', 'cower', 'dour', 'power', 'shower', 'sour', 'tower'], ['hour'], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], ['brill', 'drill', 'frill', 'grill', 'grille', 'krill', 'quill', 'shrill', 'skill', 'spill', 'still', 'swill', 'thrill', 'trill', 'twill'], ['bill', 'chill', 'dill', 'fill', 'gill', 'hill', 'kill', 'lille', 'mil', 'mill', 'nil', 'pill', 'rill', 'shill', 'sill', 'thill', 'till', 'will', 'zill'], ['closure', 'crozier', 'pleasure', 'treasure'], ['closure', 'crozier', 'pleasure', 'treasure'], ['b', 'be', 'bee'], []] |
This returns a few words that are very similar to the original. To reduce this, we can check the “edit distance”, the number of letters you would have to change to turn one letter into another. Unfortunately on it’s own this takes away some good rhymes, so we also to check if the first letters differ too. This allos “tower” and “power” to match, but not “old” and “olde” (the reason you will still see some examples like this is words are compared to the original – not each other).
def rhymes(s): try: (w, l, p) = s[0] try: filtered = [wt for (wt, pt) in cmudict.entries() if l == len(pt) and p[-2:] == pt[-2:] and (nltk.distance.edit_distance(w, wt) > 2 \ or not w[0:2] == wt[0:2]) and len(nltk.corpus.wordnet.synsets(wt)) > 0] return filtered except: return [w] except: return [] [['bower', 'cower', 'dour', 'shower', 'sour', 'tower'], [], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'yack', 'yak'], ['dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], ['brill', 'drill', 'frill', 'grill', 'grille', 'krill', 'quill', 'shrill', 'spill', 'still', 'swill', 'thrill', 'trill', 'twill'], ['bill', 'chill', 'dill', 'fill', 'gill', 'hill', 'lille', 'mil', 'mill', 'nil', 'pill', 'rill', 'shill', 'sill', 'thill', 'till', 'will', 'zill'], ['closure', 'crozier', 'treasure'], ['closure', 'crozier', 'pleasure'], ['b'], []] |
From here, we can remove some words that don’t make sense in context by checking what parts of speech Wordnet has seen them in. We get the set of all possible part of speech tags for a word, and compare that to a potentially rhyming word – if there is overlap, we keep it.
def cmp(p, wt): pt = set([s.pos for s in wordnet.synsets(wt)]) return len(pt & p) > 0 or len(p) == 0 or len(pt) == 0 def rhymes(s): try: (w, l, p) = s[0] try: pos = set([s.pos for s in wordnet.synsets(w)]) filtered = [wt for (wt, pt) in cmudict.entries() if l == len(pt) and p[-2:] == pt[-2:] and (nltk.distance.edit_distance(w, wt) > 2 \ or not w[0:2] == wt[0:2]) and cmp(pos, wt) and len(nltk.corpus.wordnet.synsets(wt)) > 0] return filtered except: return [w] except: return [] |
Note this removes “dour” in the first part, where we expect to see a noun of some sort. The list isn’t too repetitive now, although it is getting quite short.
[['bower', 'cower', 'shower', 'sour', 'tower'], [], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'yack', 'yak'], ['dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], ['brill', 'drill', 'frill', 'grill', 'grille', 'krill', 'quill', 'spill', 'still', 'swill', 'thrill', 'trill', 'twill'], ['bill', 'chill', 'dill', 'fill', 'gill', 'hill', 'lille', 'mil', 'mill', 'nil', 'pill', 'rill', 'shill', 'sill', 'thill', 'till', 'will', 'zill'], ['closure', 'crozier', 'treasure'], ['closure', 'crozier', 'pleasure'], ['b'], []] |
One minor issue here is that “sour” is in the first set. Let’s try one last attempt – instead of checking all possible roles a word could play, lets check the role it actually plays, by running the text through a part of speech tagger. Finally can justify having the entire text the whole text of each line, even though we’ve been ignoring it thus far.
The NLKT part of speech tagger is much more detailed than what Wordnet has – this is because Wordnet acts as a dictionary of nouns, verbs, adjectives, and adverbs, but is less concerned about their roles, so we need to establish a mapping.
tagged = [nltk.pos_tag(t, simplify_tags = True) for t in tokens] [[('O', 'NNP'), ('thou', 'VBD'), (',', ','), ('my', 'PRP |
Interestingly, this actually returns more than the prior technique. This is because NLTK can tag many more words than Wordnet, since Wordnet is aiming at specific semantic interests. This maintains properties similar to the previous list (“dour” is still out), but less has been removed. Overall the word diversity is quite high as well. Clearly, one could spend a lot of time even on this simple task.
[rhymes(r) for r in syllables] [['bower', 'shower', 'sour', 'tower'], ['aerial', 'amble', 'angel', 'angle', 'ankle', 'anvil', 'april', 'arousal', 'arrival', 'axle', 'babble', 'babel', 'baffle', 'bagel', 'barrel', 'barrel', 'basel', 'basil', 'basle', 'battle', 'bauble', 'beadle', 'beagle', 'beetle', 'beigel', 'beryl', 'betel', 'bethel', 'bevel', 'bible', 'bobble', 'boodle', 'bottle', 'bubble', 'buckle', 'bushel', 'bustle', 'cable', 'cackle', 'camel', 'carol', 'carol', 'carrel', 'carroll', 'carroll', 'castle', 'cattle', 'channel', 'chapel', 'chattel', 'chisel', 'choral', 'chuckle', 'chunnel', 'circle', 'cobble', 'cockle', 'colonel', 'coral', 'couple', 'cuddle', 'cudgel', 'cycle', 'cyril', 'dazzle', 'dental', 'devil', 'dibble', 'diesel', 'diesel', 'doodle', 'double', 'duffel', 'durrell', 'equal', 'fable', 'facial', 'faisal', 'fennel', 'fiddle', 'final', 'fipple', 'fizzle', 'foible', 'fossil', 'fuel', 'funnel', 'gable', 'gaggle', 'gavel', 'geisel', 'gesell', 'giggle', 'girdle', 'gobble', 'goral', 'gurgle', 'hackle', 'haggle', 'hassel', 'hassle', 'hatchel', 'havel', 'hazel', 'heckle', 'hegel', 'herbal', 'herschel', 'hobble', 'hovel', 'hubble', 'huddle', 'hurdle', 'hustle', 'jackal', 'jiggle', 'jostle', 'journal', 'juggle', 'kennel', 'kernel', 'kettle', 'kibble', 'knuckle', 'label', 'ladle', 'laurel', 'level', 'libel', 'little', 'local', 'lovell', 'mammal', 'maple', 'medal', 'memel', 'metal', 'methyl', 'mettle', 'michael', 'mickle', 'middle', 'missal', 'missile', 'mitchell', 'mobile', 'modal', 'model', 'mogul', 'moral', 'mosul', 'motile', 'muckle', 'muddle', 'muffle', 'muscle', 'mussel', 'muzzle', 'myrtle', 'nasal', 'natal', 'navel', 'needle', 'nestle', 'nettle', 'nibble', 'nickel', 'nipple', 'noble', 'noodle', 'novel', 'nozzle', 'paddle', 'panel', 'pebble', 'pedal', 'people', 'peril', 'petal', 'phenol', 'pickle', 'piddle', 'pommel', 'poodle', 'pottle', 'puddle', 'purple', 'puzzle', 'rabble', 'rachel', 'raffle', 'rattle', 'ravel', 'razzle', 'rebel', 'revel', 'riddle', 'riffle', 'rifle', 'rigel', 'ripple', 'rival', 'roble', 'rommel', 'rouble', 'rubble', 'rubel', 'ruble', 'ruddle', 'ruffle', 'russell', 'rustle', 'sable', 'saddle', 'seckel', 'segal', 'seidel', 'settle', 'shackle', 'shekel', 'shovel', 'shuffle', 'shuttle', 'sibyl', 'social', 'sorrel', 'table', 'tackle', 'tassel', 'tattle', 'tercel', 'thermal', 'thistle', 'tickle', 'tipple', 'title', 'tittle', 'toggle', 'tootle', 'total', 'trial', 'tunnel', 'turtle', 'tussle', 'umbel', 'uncle', 'vassal', 'vergil', 'vessel', 'vigil', 'vinyl', 'virgil', 'vocal', 'waddle', 'waffle', 'wattle', 'weasel', 'weevil', 'whistle', 'whittle', 'wiesel', 'wiggle', 'wobble', 'wrestle', 'wriggle', 'yodel'], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'yack', 'yak'], [], ['brill', 'drill', 'frill', 'grill', 'grille', 'krill', 'quill', 'spill', 'still', 'swill', 'thrill', 'trill', 'twill'], ['bill', 'chill', 'fill', 'hill', 'mill', 'shill', 'till', 'will'], ['closure', 'crozier', 'treasure'], ['closure', 'crozier', 'pleasure'], [], []] |