common.skipToContent

Analyzer Guide

Learn how to detect PII entities in your text

The Analyzer is the first step in the anonymization process. It scans your text and identifies personally identifiable information (PII) like names, emails, phone numbers, and more.


How the Analyzer Works

The Analyzer uses multiple detection methods to identify PII:

Pattern Matching

Regular expressions detect structured data like email addresses, phone numbers, credit cards, and IBANs with high accuracy.

Machine Learning (NER)

Named Entity Recognition models identify context-dependent entities like person names, organizations, and locations using spaCy, Stanza, and Transformers.

Checksum Validation

Credit cards, IBANs, and other financial identifiers are validated using checksum algorithms (Luhn, MOD-97) for reduced false positives.


Using the Analyzer

Step 1: Enter Your Text

  1. Navigate to the Anonymize page
  2. Paste or type your text in the input area
  3. The interface shows a character count and token estimate

Step 2: Select Entity Types

Choose which types of PII to detect:

CategoryEntity TypesExample
PersonalPERSON, EMAIL_ADDRESS, PHONE_NUMBERJohn Doe, john@email.com
FinancialCREDIT_CARD, IBAN_CODE, SWIFT_CODE4111-1111-1111-1111
LocationLOCATION, ADDRESS, COORDINATES123 Main St, New York
GovernmentSSN, PASSPORT, DRIVER_LICENSE123-45-6789
TechnicalIP_ADDRESS, MAC_ADDRESS192.168.1.1

Tip: Use Presets

Instead of selecting entities manually, use Presets to quickly apply common entity configurations like "GDPR Compliance" or "Financial Data".

Step 3: Select Language

Choose the language of your text for optimal detection accuracy:

  • Auto-detect - Let the system determine the language
  • Specific language - Select from 27 supported languages

Language Selection Matters

Selecting the correct language significantly improves detection accuracy, especially for person names and locations.

Step 4: Run Analysis

  1. Click the Analyze button
  2. Wait for the analysis to complete (typically 1-3 seconds)
  3. Review the detected entities in the results panel

Understanding Results

After analysis, each detected entity shows:

PERSONJohn Doe95% confidence

Position: characters 0-8

Result Fields

  • Entity Type - The category of PII detected (PERSON, EMAIL, etc.)
  • Text - The actual text that was identified as PII
  • Confidence Score - How certain the system is (0-100%)
  • Position - Start and end character positions

Confidence Threshold

Adjust the confidence threshold to control sensitivity:

ThresholdEffectBest For
0.3 (Low)More entities detected, more false positivesMaximum coverage, manual review
0.5 (Default)Balanced detection and accuracyGeneral use
0.7 (High)Fewer entities, higher confidenceAutomated processing
0.9 (Very High)Only very confident matchesMinimal intervention

Selecting Results

After analysis, you can refine which entities to anonymize:

Select/Deselect All

  • Use the checkbox in the header to select or deselect all results
  • Only selected entities will be anonymized

Individual Selection

  • Click individual checkboxes to include/exclude specific entities
  • Useful when the analyzer detects false positives
  • Useful when you want to keep certain information visible

Filter by Type

  • Click on an entity type badge to filter results by that type
  • Quickly select/deselect all entities of a specific type

Pro Tip

Review results before anonymizing. The analyzer may occasionally detect false positives, especially for names that are also common words.


Token Costs

Analysis operations consume tokens based on:

Cost = 2 + 1.0 × text_k + 0.2 × entities_enabled + 0.1 × entities_found

Final = ceil(Cost × 0.5)

Where:

  • text_k = text length in thousands of characters
  • entities_enabled = number of entity types selected
  • entities_found = number of entities detected

Cost Examples

Text LengthEntitiesTypical Cost
100 characters3 types, 2 found2 tokens
1,000 characters5 types, 5 found3 tokens
5,000 characters10 types, 15 found6 tokens
10,000 characters15 types, 30 found10 tokens

See the Token System documentation for complete pricing details.


Best Practices

Select only the entity types you need - reduces costs and false positives
Use language-specific presets for better accuracy in non-English text
Review results before anonymizing, especially for names and locations
Use higher confidence thresholds for automated processing
Process text in reasonable chunks (under 10,000 characters) for best performance

Troubleshooting

Entity not detected?

  • Ensure the entity type is enabled in your selection
  • Try lowering the confidence threshold
  • Check that the correct language is selected
  • Verify the text format matches expected patterns

Too many false positives?

  • Increase the confidence threshold
  • Deselect broad entity types like LOCATION
  • Use entity-specific presets instead of selecting all

Analysis taking too long?

  • Break large texts into smaller chunks
  • Reduce the number of entity types selected
  • Use presets to avoid loading unused detection models

Next Steps

Last Updated: February 2026