Article

Can Today’s NLP Technology Really Understand Legal Contracts?

An Interview with ContractAI’s Principal Data Scientist and Head of R&D – Dr. Shilpa Balakrishnan

Can Today’s NLP Technology Really Understand Legal Contracts?

An Interview with ContractAI’s Principal Data Scientist and Head of R&D – Dr. Shilpa Balakrishnan

By Sathya Ramnath

For those who are unfamiliar with NLP, can you help us define NLP and some of the common uses of the technology today?

Natural Language Processing (NLP) is a branch of Machine Learning (ML)/ Artificial Intelligence (AI) that deals with the use of computers to understand, interpret, and process human language including speech and text. Quite often, NLP is used as an umbrella term for NLU (Natural Language Understanding) or NLI (Natural Language Interpretation) and NLG (Natural Language Generation). NLU/NLI techniques are used to extract and convey the intent or meaning in the language. There are multiple sources of information available in different forms that a human can process, including words, images, and signs. With NLP, a system should be able to read, derive, understand, and infer human language in a way that adds value.

There are multiple examples of NLP applications that we see in our daily life - smart assistants such as Siri and Alexa, which recognize speech patterns through voice recognition, infer meaning, and then provide a helpful response. Another example of NLP at work is the related keywords on search engines. The system recognizes the appropriate word or phrase cues and gives suggestions by interpreting its meaning. In the last few years, the use of Transformer architectures has led to a rapid advancement in NLP for various tasks including information retrieval, Q&A - where we can get a response span to a question asked against any document (text) or for structured data (e.g., table) querying.

There are so many applications and devices on the market using NLP technologies. Can you break down the different types of NLP technologies in the market?

There is a wide variety of open-source NLP technologies, which are free and flexible for developers to use. These tools have NLP libraries that can help businesses analyze unstructured information through metadata extraction, text analysis, classifications, and sentiment analysis. A few examples of these libraries include Stanford CoreNLP (e.g., NER -dates, periods), SpaCy (e.g., grammatical parsers), Natural Language Tool-Kit (NLTK), Huggingface (e.g., Transformers), FLAIR (e.g., model stacking) and OpenGrm Thrax Grammar Compiler (FST).

You have been working on developing and improving the natural language processing capability of ContractAI since its inception. Can you talk about how NLP can be incorporated in Contract Lifecycle Management (CLM)?

Historical and third-party contracts are a gold mine of valuable information. The most basic use of NLP in CLM is its use for metadata extraction. To extract and model valuable insights from the existing contracts, a robust and state-of-the-art NLP system needs to be in place. Instead of manually typing in all the key pieces of contract metadata like company names and addresses, effective and expiration dates, and signatory names, NLP would do it for you. You can then perform in depth analysis or utilize the information for post-contract management.

A more advanced but less explored use of NLP in CLM is its ability to read, extract, and decipher all contractual language at a clause and subclause level. This helps the tool to understand all of the past positions held by the contracting stakeholders and provides valuable insights in enriching the contract language. However, this level of sophistication needs substantial training and optimized learning to comprehend specific contract language.

How effective are current NLP technologies on the market in understanding contracting language?

The syntax and knowledge associated with deep insight inferences can only be achieved if the NLP model is trained and optimized for the procurement and legal industry. However, most open-source NLP tools are based on general domain/press style language (i.e., Wikipedia, news corpus). One of the biggest challenges of NLP is to understand language with variable context- sentences that have different interpretations depending on certain domain-specific information such as business contracts.

For example, the two clauses mentioned below would be regarded similar per most open-source similarity models; however, they have significantly different legal implications on a contractual level.

Sentence A: During the Term and for a period of two years thereafter, or for a period of seven years from the date of creation of the Records (whichever is longer) the Supplier shall keep full, true and accurate Records to show compliance with its obligations under this Agreement together with any other records that are required by any professional rules of any regulatory body which apply to the activities of the Supplier or as may from time to time be agreed in writing between the Company and the Supplier.

Sentence B: During the Term and for a period of two years thereafter, or for a period of seven years from the date of creation of the Records (whichever is longer) the Supplier shall keep full, true and accurate Records to show compliance with its obligations under this Agreement together with such other records as may from time to time be agreed in writing between the Company and the Supplier.

How are you approaching that problem? What is your team doing to better understand contracting language?

As an extension to the previous example let us consider the similarity problem. ContractAI’s baseline similarity model is built using open-source text embeddings and vector distance estimates. However, these are severely limited because these models are pretrained on general domain language. To overcome this limitation, App Orchid’s data science team has built models, which could be either ML/statistical/algebraic on top of open-source models guided by the procurement and legal-specific language. ContractAI uses the right tools, in the right combinations, in the right order to build proprietary learning algorithms to understand the contract language.

The NLP models are trained or fine-tuned on a large corpus of contracts, allowing it to build statistical and weighed ML/AI models. We have compiled a legal dictionary for contracting language and are constantly building and expanding ContractAI’s legal contract ontology/knowledge graphs.. This gives us the ability to interpret natural language querying against both free form textual and structured data. In combination, we have created a system that leverages language syntax, semantics and knowledge graphs resulting in a more accurate interpretable language representation within the procurement space.

Can you give us an example of how your improvements to NLP make a difference in understanding contracting language?

From the text embedding and vector distance viewpoint, if we take the example of the two sentences discussed before:

Sentence A: During the Term and for a period of two years thereafter, or for a period of seven years from the date of creation of the Records (whichever is longer) the Supplier shall keep full, true and accurate Records to show compliance with its obligations under this Agreement together with any other records that are required by any professional rules of any regulatory body which apply to the activities of the Supplier or as may from time to time be agreed in writing between the Company and the Supplier.

Sentence B: During the Term and for a period of two years thereafter, or for a period of seven years from the date of creation of the Records (whichever is longer) the Supplier shall keep full, true and accurate Records to show compliance with its obligations under this Agreement together with such other records as may from time to time be agreed in writing between the Company and the Supplier.

The open-source models gave us a normalized similarity score of 0.92. ContractAI recognizes that these have more significant difference from a legal perspective, and our proprietary model gave a normalized similarity score of 0.75.

From the metadata extraction viewpoint, we have seen numerous instances where the more commonly used open-source models such as Hugging Face or SpaCy struggle with extracting even simple contract metadata such as dates and addresses accurately. However, ContractAI’s custom models use proprietary algorithms that consistently show greater accuracy. Hence, ContractAI is not limited to extracting only a few data points, as seen with other solutions in the market. The solution can be configured to pull any kind of information as per the customers’ requirements.

What should companies look for when evaluating the NLP capabilities of CLM products?

With terms as AI, ML, and NLP being used as a catch-all phrase, it has become crucial to understand the depth of automation and text analytics that is driven by the computational models as opposed to resource and time intensive human intervention at every stage. While evaluating NLP capabilities, the companies should look at three major parameters:

  • Does the product automatically read and extract any data in a contract regardless of format and automatically identify clauses?
  • Does the platform leverage contract specific ontologies (or knowledge graphs) to comprehend procurement legalese?
  • Does the product reflect the use of NLU/NLI tools to understand, interpret and identify contractual language based on context?


Dr. Shilpa Balakrishnan

Principal Data Scientist and Head of R&D, App Orchid

© 2021 - App Orchid. All rights reserved