3 open source NLP tools for data extraction
3 open source NLP tools for data extraction
Developers and data scientists use generative AI and large language models (LLMs) to query volumes of documents and unstructured data. Open source LLMs, including Dolly 2.0, EleutherAI Pythia, Meta AI LLaMa, StabilityLM, and others, are all starting points for experimenting with artificial intelligence that accepts natural language prompts and generates summarized responses.
“Text as a source of knowledge and information is fundamental, yet there aren’t any end-to-end solutions that tame the complexity in handling text,” says Brian Platz, CEO and co-founder of Fluree. “While most organizations have wrangled structured or semi-structured data into a centralized data platform, unstructured data remains forgotten and underleveraged.”
If your organization and team aren’t experimenting with natural language processing (NLP) capabilities, you’re probably lagging behind competitors in your industry. In the 2023 Expert NLP Survey Report, 77% of organizations said they planned to increase spending on NLP, and 54% said their time-to-production was a top return-on-investment (ROI) metric for successful NLP projects.
Use cases for NLP
If you have a corpus of unstructured data and text, some of the most common business needs include
- Entity extraction by identifying names, dates, places, and products
- Pattern recognition to discover currency and other quantities
- Categorization into business terms, topics, and taxonomies
- Sentiment analysis, including positivity, negation, and sarcasm
- Summarizing the document’s key points
- Machine translation into other languages
- Dependency graphs that translate text into machine-readable semi-structured representations
Sometimes, having NLP capabilities bundled into a platform or application is desirable. For example, LLMs support asking questions; AI search engines enable searches and recommendations; and chatbots support interactions. Other times, it’s optimal to use NLP tools to extract information and enrich unstructured documents and text.
Let’s look at three popular open source NLP tools that developers and data scientists are using to perform discovery on unstructured documents and develop production-ready NLP processing engines.
Natural Language Toolkit
The Natural Language Toolkit (NLTK), released in 2001, is one of the older and more popular NLP Python libraries. NLTK boasts more than 11.8 thousand stars on GitHub and lists over 100 trained models.
“I think the most important tool for NLP is by far Natural Language Toolkit, which is licensed under Apache 2.0,” says Steven Devoe, director of data and analytics at SPR. “In all data science projects, the processing and cleaning of the data to be used by algorithms is a huge proportion of the time and effort, which is particularly true with natural language processing. NLTK accelerates a lot of that work, such as stemming, lemmatization, tagging, removing stop words, and embedding word vectors across multiple written languages to make the text more easily interpreted by the algorithms.”
NLTK’s benefits stem from its endurance, with many examples for developers new to NLP, such as this beginner’s hands-on guide and this more comprehensive overview. Anyone learning NLP techniques may want to try this library first, as it provides simple ways to experiment with basic techniques such as tokenization, stemming, and chunking.
spaCy
spaCy is a newer library, with its version 1.0 released in 2016. spaCy supports over 72 languages and publishes its performance benchmarks, and it has amassed more than 25,000 stars on GitHub.
“spaCy is a free, open-source Python library providing advanced capabilities to conduct natural language processing on large volumes of text at high speed,” says Nikolay Manchev, head of data science, EMEA, at Domino Data Lab. “With spaCy, a user can build models and production applications that underpin document analysis, chatbot capabilities, and all other forms of text analysis. Today, the spaCy framework is one of Python’s most popular natural language libraries for industry use cases such as extracting keywords, entities, and knowledge from text.”
Tutorials for spaCy show similar capabilities to NLTK, including named entity recognition and part-of-speech (POS) tagging. One advantage is that spaCy returns document objects and supports word vectors, which can give developers more flexibility for performing additional post-NLP data processing and text analytics.
Spark NLP
If you already use Apache Spark and have its infrastructure configured, then Spark NLP may be one of the faster paths to begin experimenting with natural language processing. Spark NLP has several installation options, including AWS, Azure Databricks, and Docker.
“Spark NLP is a widely used open-source natural language processing library that enables businesses to extract information and answers from free-text documents with state-of-the-art accuracy,” says David Talby, CTO of John Snow Labs. “This enables everything from extracting relevant health information that only exists in clinical notes, to identifying hate speech or fake news on social media, to summarizing legal agreements and financial news.
Spark NLP’s differentiators may be its healthcare, finance, and legal domain language models. These commercial products come with pre-trained models to identify drug names and dosages in healthcare, financial entity recognition such as stock tickers, and legal knowledge graphs of company names and officers.
Talby says Spark NLP can help organizations minimize the upfront training in developing models. “The free and open source library comes with more than 11,000 pre-trained models plus the ability to reuse, train, tune, and scale them easily,” he says.
Best practices for experimenting with NLP
Earlier in my career, I had the opportunity to oversee the development of several SaaS products built using NLP capabilities. My first NLP was an SaaS platform to search newspaper classified advertisements, including searching cars, jobs, and real estate. I then led developing NLPs for extracting information from commercial construction documents, including building specifications and blueprints.
When starting NLP in a new area, I advise the following:
- Begin with a small but representable example of the documents or text.
- Identify the target end-user personas and how extracted information improves their workflows.
- Specify the required information extractions and target accuracy metrics.
- Test several approaches and use speed and accuracy metrics to benchmark.
- Improve accuracy iteratively, especially when increasing the scale and breadth of documents.
- Expect to deliver data stewardship tools for addressing data quality and handling exceptions.
You may find that the NLP tools used to discover and experiment with new document types will aid in defining requirements. Then, expand the review of NLP technologies to include open source and commercial options, as building and supporting production-ready NLP data pipelines can get expensive. With LLMs in the news and gaining interest, underinvesting in NLP capabilities is one way to fall behind competitors. Fortunately, you can start with one of the open source tools introduced here and build your NLP data pipeline to fit your budget and requirements.