Python (nltk) - More Efficient Way To Extract Noun Phrases?
I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature constructio
Solution 1:
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
With ne_chunk
and solution from
[code]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
defget_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
iftype(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity notin continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continuereturn continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[out]:
0 [New York]
1 [Washington, Bruce Wayne]
Name:text, dtype: object
To use the custom RegexpParser
:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
defget_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
iftype(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity notin continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continuereturn continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[out]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name:text, dtype: object
Solution 2:
I suggest referring to this prior thread: Extracting all Nouns from a text file using nltk
They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)
Solution 3:
The above methods didn't give me the required results. Following is the function that I would suggest
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re
defget_noun_phrases(text):
pos = pos_tag(word_tokenize(text))
count = 0
half_chunk = ""for word, tag in pos:
if re.match(r"NN.*", tag):
count+=1if count>=1:
half_chunk = half_chunk + word + " "else:
half_chunk = half_chunk+"---"
count = 0
half_chunk = re.sub(r"-+","?",half_chunk).split("?")
half_chunk = [x.strip() for x in half_chunk if x!=""]
return half_chunk
Post a Comment for "Python (nltk) - More Efficient Way To Extract Noun Phrases?"