common.skipToContent

Analyzer Guide

Learn how to detect PII entities in your text

The Analyzer is the first step in the anonymization process. It scans your text and identifies personally identifiable information (PII) like names, emails, phone numbers, and more.


How the Analyzer Works

The Analyzer uses multiple detection methods to identify PII:

Pattern Matching

Regular expressions detect structured data like email addresses, phone numbers, credit cards, and IBANs with high accuracy.

Machine Learning (NER)

Named Entity Recognition models identify context-dependent entities like person names, organizations, and locations using spaCy, Stanza, and Transformers.

Checksum Validation

Credit cards, IBANs, and other financial identifiers are validated using checksum algorithms (Luhn, MOD-97) for reduced false positives.


Using the Analyzer

Step 1: Enter Your Text

  1. Navigate to the Anonymize page
  2. Paste or type your text in the input area
  3. The interface shows a character count and token estimate

Step 2: Select Entity Types

Choose which types of PII to detect:

Entity TypesWe support 256 entity types organized into 10 categories:Text - The actual text that was identified as PII
Personal - Names, emails, phone numbers, dates of birthPERSON, EMAIL_ADDRESS, PHONE_NUMBERJohn Doe, john@email.com
Financial - Credit cards, bank accounts, IBAN, crypto walletsCREDIT_CARD, IBAN_CODE, SWIFT_CODE4111-1111-1111-1111
Location - Addresses, cities, countries, coordinatesLOCATION, ADDRESS, COORDINATES123 Main St, New York
Government - SSN, passport numbers, driver licenses, national IDsSSN, PASSPORT, DRIVER_LICENSE123-45-6789
Technical - IP addresses, MAC addresses, device IDsIP_ADDRESS, MAC_ADDRESS192.168.1.1

Instead of selecting entities manually, use Presets to quickly apply common entity configurations like "GDPR Compliance" or "Financial Data".

Instead of selecting entities manually, use Presets to quickly apply common entity configurations like "GDPR Compliance" or "Financial Data".

Step 3: Select Language

Choose the language of your text for optimal detection accuracy:

  • Auto-detect - Let the system determine the language - Let the system determine the language
  • Specific language - Select from 27 supported languages - Select from 27 supported languages

Language Selection Matters

Selecting the correct language significantly improves detection accuracy, especially for person names and locations.

Step 4: Run Analysis

  1. Click the Analyze button
  2. Wait for the analysis to complete (typically 1-3 seconds)
  3. Review the detected entities in the results panel

Understanding Results

After analysis, each detected entity shows:

PERSONJohn Doeconfidence

Position: characters

Result Fields

  • Entity Type - The category of PII detected (PERSON, EMAIL, etc.) - The category of PII detected (PERSON, EMAIL, etc.)
  • Text - The actual text that was identified as PII - The actual text that was identified as PII
  • Confidence Score - How certain the system is (0-100%) - How certain the system is (0-100%)
  • Position - Start and end character positions - Start and end character positions

Confidence Threshold

Adjust the confidence threshold to control sensitivity:

ThresholdEffectBest For
LowMore entities detected, more false positivesMaximum coverage, manual review
DefaultBalanced detection and accuracyGeneral use
HighFewer entities, higher confidenceAutomated processing
Very HighOnly very confident matchesMinimal intervention

Selecting Results

After analysis, you can refine which entities to anonymize:

Select/Deselect All

  • Use the checkbox in the header to select or deselect all results
  • Only selected entities will be anonymized

Individual Selection

  • Click individual checkboxes to include/exclude specific entities
  • Useful when the analyzer detects false positives
  • Useful when you want to keep certain information visible

Filter by Type

  • Click on an entity type badge to filter results by that type
  • Quickly select/deselect all entities of a specific type

Review results before anonymizing. The analyzer may occasionally detect false positives, especially for names that are also common words.

Review results before anonymizing. The analyzer may occasionally detect false positives, especially for names that are also common words.


Token Costs

Analysis operations consume tokens based on:

Cost = 2 + 1.0 × text_k + 0.2 × entities_enabled + 0.1 × entities_found

Final = ceil(Cost × 0.5)

Where:

  • text_k = Text Length
  • entities_enabled = Entities
  • entities_found = number of entities detected

Typical Cost

Text LengthEntitiesTypical Cost
100 characters3 types, 2 found2 tokens
1,000 characters5 types, 5 found3 tokens
5,000 characters10 types, 15 found6 tokens
10,000 characters15 types, 30 found10 tokens

Token System documentation Token System documentation for complete pricing details.


Best Practices

Select only the entity types you need - reduces costs and false positives
Use language-specific presets for better accuracy in non-English text
Review results before anonymizing, especially for names and locations
Use higher confidence thresholds for automated processing
Process text in reasonable chunks (under 10,000 characters) for best performance

Troubleshooting

Entity not detected?

  • Ensure the entity type is enabled in your selection
  • Try lowering the confidence threshold
  • Check that the correct language is selected
  • Verify the text format matches expected patterns

Too many false positives?

  • Increase the confidence threshold
  • Deselect broad entity types like LOCATION
  • Use entity-specific presets instead of selecting all

Analysis taking too long?

  • Break large texts into smaller chunks
  • Reduce the number of entity types selected
  • Use presets to avoid loading unused detection models

Next Steps

Last Updated: March 2026