Using machine language models to help students identify proper nouns: Names, named entities, and the Wheel of Fortune Corpus
What is practical NLP?
Machine learning and natural language processing have made incredible strides recently, but sometimes they seem a bit ….detached…. from the real world problems faced by low budget ESL teachers and language learners. So, I decided to do a series on practical NLP with a focus on language learning. In this first post I want to:
- introduce Named Entity Recognition (NER)
- provide a motivation for using NER to help language learners
- show how easy it is to use Spacy NER in practice.
In future posts I will talk about more technical details such as deploying NER as an API, creating a user interface and fine tuning the language model for better performance.
And what is NER?
NER stands for Named entity recognition; it just means identifying entities in a text; what language people might think of as finding proper nouns. You can identify many categories of entities, such as people’s names, place names, movie titles, organizations, etc. NER is already a well developed area of Natural language processing and there are many excellent language models freely available that can perform NER effectively off the shelf.
So, why do we need NER?
Let’s motivate the use case of NER with a little linguistic background. Question: how many words do you need to know in order to be able to speak a language? It’s actually a surprisingly difficult and divisive question. In linguistics circles there’s a lot of sharp disagreement about the details of what it actually means to “know” a word and what a “word” is (or a “word family”, or a “lexeme”, or a “lemma”….). To give us a rough framework for discussing the topic, here are some informal benchmarks that English teachers use when the linguists aren’t looking :
- 1000: the core vocabulary items in English
- 2000: the number of vocabulary items you need to take care of most daily business
- 8,000–10,000: the number of vocabulary items you need to read the majority of texts/ go to university
- 20,000–40,000: the number of vocabulary items native speakers know
Paul Nation, is a great source to study this topic more formally. However, he and other language professionals calculate word lists such as the above “excluding proper names and transparently derived forms.” The problem for language learners is that there are a lot of proper names and they appear in all sorts of texts.
Confusion with proper nouns
ESL Students often see common names like “John,” “Mary” and “Paris”, but how about “Reginald,” or “Los Alamos,” or, “Men in Black”? Even students with large vocabularies get stuck on proper nouns. If you say :
“Let’s go see Men in Black”, they are thinking:
Who are these “Men” that you speak of? And why are they “in Black”?
Easy, you say? Try it in Chinese.
To really understand how unintuitive names may be for non native speakers, consider this simple Chinese sentence:
Any student of Mandarin can tell you that “有” means “has”. And “人” and “人民” can mean “people.” But what is “馬修沃斯”? and what is “幣”?
It turns out that “馬修沃斯” is my name and “人民幣” is the Chinese currency. Without knowing this though, students can use NER to understand the sentence meaning: “Person has 10 currency.”
The Wheel of Fortune Corpus
The linguist Ray Jackendoff has a nice framing of this phenomena: he calls it the “The Wheel of Fortune Corpus” and he uses it to indicate that native speakers actually know a lot more vocabulary items than we give them credit for . Think about it, Wheel of Fortune has been on TV forever, but they never run out of phrases to use. Somehow, if you show a native speaker a phrase like
_ _ E A_ _A S T
_ _ A _ _ IONS
they can yell out the solution and win a dining room set or cash. Non-native speakers meanwhile have no chance of answering, or even of understanding the answer after they have seen it. See Jackendoff’s The Architecture of the Language Faculty for more fun insights about language and vocabulary.
Named entity recognition to the rescue
So how can we use machine learning help these struggling students? Free language models like Spacy’s can get good results on named entity recognition straight off the shelf (although additional training helps the models’ performance). I chose the Spacy models because they are small, have a lot of different entity types and include a nice display function. Spacy also offers a clear website with lots of tutorials, and there are lots of great articles diving into the details of Spacy. They currently offer 18 languages with trained pipelines, so you literally don’t have to do anything more than import them from the models page.
A word about models and model size
There are a lot of different models, and if you want state of the art performance you will need to make space for a larger language model. For English I like to use Spacy’s “en_core_web_trf,” which means that the model is English, core includes vocabulary, syntax, entities and vectors and web means written text from the internet. Trf is a roberta-base model and it works great, but it’s big (438 MB). The smallest English model is only 13 MB, and works well, but not perfectly. Similarly , “zh_core_web_trf” is a 398 MB, bert-base chinese model that works very well. On the other hand, the small Chinese model “zh_core_web_sm” provides rather poor results. In my next post I will talk about practical tips for deploying and fine tuning these models.
Code Break: Start with the imports
Let’s look at the core code to see how Spacy works (for the full code see my github repo.) In lines 1–7, we import the dependencies starting with spacy, displacy is the display function I mentioned earlier that tags the entities in different colors, finally nlp = spacy.load(language_model_name) allows you to add any available language. Here, I load the larger English and Chinese models.
Basic code for identifying entities with NER
The Main function
Using Spacy you can create a dictionary of which entity types you want to identify; in lines 10–12 I choose 12 entity types to label, because they seem most useful for language learners. Spacy offers 18 entity types, or you can define your own entities. Raw_text is the input text, in this case taken from google news. The magic happens in line
14 when we call “nlp(raw_text)” to analyze the text. The Spacy website explains:
When you call
nlpon a text, spaCy first tokenizes the text to produce a
Docis then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed
Doc, which is then passed on to the next component.
Finally, line 15, “displacy.render,” creates a colorful chart with the named entities from the text. It does look like the final user interface will need to add some explanations of the labels though: EVENT is pretty clear, but most users won’t know that GPE is “Countries, cities, states.”
Results from displacy.render
So, that’s it for part 1 of this series on turning current machine learning models into free, easy to use language tools. I hope that I’ve at least convinced you of the potential of using NER for language acquisition. Future posts will cover deployment and fine tuning of the NER model, along with introducing other NLP applications that can easily be applied to helping language learners. Meanwhile, I’d be glad to hear any suggestions or comments. You can leave them below or connect with me on LinkedIn.
On a final note, if you are running this code on your local machine you will need to install spacy and the language model, see more here.
- pip install -U pip setuptools wheel
- pip install -U spacy
- python -m spacy download en_core_web_trf