human-language

N-gram Feature Implementation Summary

Overview

The text transformer now supports configurable n-gram matching, allowing it to recognize multi-word phrases as single entities rather than breaking them into individual words.

Key Features

1. Configurable N-gram Size

2. Priority-Based Matching

3. Examples of Behavior

With maxNgramSize=1:

With maxNgramSize=2:

With maxNgramSize=3:

Implementation Details

Algorithm:

  1. Generate all possible n-grams up to maxNgramSize
  2. Filter out n-grams containing only stop words
  3. Search for all n-grams in parallel
  4. Sort results by n-gram size (descending)
  5. Apply matches in priority order, marking used tokens
  6. Return matches in original order

Performance:

Test Results

Mock Tests: N-gram specific tests passing

E2E Tests: Real API integration working

Feature Demo: Demonstrates clear differences in behavior with different maxNgramSize values

Usage

In HTML Demo:

In Code:

const result = await transformer.transform("Barack Obama was president", {
  maxNgramSize: 3,  // Consider up to 3-word phrases
  // other options...
});

Benefits

  1. More accurate entity recognition for known multi-word phrases
  2. Reduces ambiguity by matching longer, more specific phrases
  3. Configurable to balance between precision and recall
  4. Maintains backward compatibility (default behavior unchanged)