Skip to main content

Data Extraction

Intelligent data extraction from documents with structured JSON output, field validation, and confidence scoring.

Overview

botsKYC automatically extracts structured data from unstructured documents using advanced AI models. The system analyzes document content, identifies key fields, validates extracted data, and returns results in a standardized JSON format with confidence scores.

Key Features

Structured JSON Output - Consistent schema across all document types
Field Validation - Automatic validation of extracted fields
Confidence Scoring - Per-field and overall confidence metrics
Multi-Page Support - Extract from PDFs and multi-page documents
Smart Merging - Combine data from front/back of documents
Error Detection - Identify missing or invalid fields

How It Works

Extraction Process

1. Document Analysis

The system first analyzes the document to:

  • Identify document type
  • Detect text regions
  • Recognize layout structure
  • Classify key-value pairs

2. Data Extraction

AI models extract structured data:

  • Names and personal information
  • Identification numbers
  • Dates and expiry information
  • Addresses and locations
  • Financial data
  • Business information

3. Validation

Each field is validated against rules:

  • Format validation (dates, IDs, emails)
  • Range validation (ages, amounts)
  • Cross-field validation (consistency)
  • Required field checking

4. Confidence Scoring

Confidence scores are calculated for:

  • Individual fields (0-00%)
  • Document sections
  • Overall document confidence

Extracted Data Structure

Identity Documents

{
"documentType": "OMANG",
"extractedData": {
"personalInfo": {
"fullName": "John Doe",
"firstName": "John",
"surname": "Doe",
"dateOfBirth": "990-0-5",
"gender": "Male",
"nationality": "Botswana"
},
"identification": {
"idNumber": "123456789",
"documentNumber": "OM123456789",
"issueDate": "2020-01-15",
"expiryDate": "2030-01-15"
},
"address": {
"district": "Gaborone",
"village": "Gaborone",
"ward": "5"
}
},
"confidence": {
"overall": 98.5,
"fields": {
"fullName": 99.,
"idNumber": 98.8,
"dateOfBirth": 97.5
}
},
"validation": {
"status": "PASSED",
"errors": [],
"warnings": []
}
}

Address Documents

{
"documentType": "UTILITY_BILL",
"extractedData": {
"accountHolder": {
"name": "John Doe",
"accountNumber": "ACC-123456"
},
"address": {
"street": "123 Main Street",
"city": "Gaborone",
"postalCode": "00000",
"country": "Botswana"
},
"documentInfo": {
"issuer": "Water Utilities Corporation",
"issueDate": "2025-10-15",
"dueDate": "2025-11-15",
"billPeriod": "September 2025"
},
"charges": {
"currentCharges": 450.00,
"previousBalance": 0.00,
"totalDue": 450.00,
"currency": "BWP"
}
},
"confidence": {
"overall": 96.,
"fields": {
"name": 98.5,
"address": 95.0,
"amount": 97.8
}
}
}

Income Documents

{
"documentType": "PAYSLIP",
"extractedData": {
"employee": {
"name": "John Doe",
"employeeId": "EMP-1001",
"position": "Software Engineer"
},
"employer": {
"name": "ABC Company Ltd",
"address": "Gaborone, Botswana"
},
"payment": {
"payPeriod": "October 2025",
"payDate": "2025-10-31",
"grossSalary": 15000.00,
"netSalary": 12500.00,
"currency": "BWP"
},
"deductions": {
"tax": 2000.00,
"pension": 500.00,
"total": 2500.00
},
"bankDetails": {
"accountNumber": "1234567890",
"bankName": "First National Bank"
}
},
"confidence": {
"overall": 97.8,
"fields": {
"grossSalary": 99.5,
"netSalary": 99.,
"tax": 98.5
}
}
}

Business Entity Documents

{
"documentType": "BIN_CERTIFICATE",
"extractedData": {
"company": {
"name": "ABC Trading (Pty) Ltd",
"registrationNumber": "BIN123456789",
"businessType": "Private Company",
"registrationDate": "2010-05-15"
},
"directors": [
{
"name": "John Doe",
"idNumber": "123456789",
"nationality": "Botswana"
},
{
"name": "Jane Smith",
"idNumber": "987654321",
"nationality": "Botswana"
}
],
"registeredOffice": {
"street": "Plot 123, Main Mall",
"city": "Gaborone",
"postalAddress": "P.O. Box 12345"
},
"businessActivity": "Retail Trading"
},
"confidence": {
"overall": 95.5,
"fields": {
"companyName": 98.0,
"binNumber": 99.5,
"directors": 94.
}
}
}

Field Validation

Validation Rules

ID Numbers (Omang)

{
"field": "idNumber",
"rules": [
"Must be 9 digits",
"Format: XXXXXXXXX",
"Valid checksum"
],
"example": "123456789"
}

Dates

{
"field": "dateOfBirth",
"rules": [
"Valid date format (YYYY-MM-DD)",
"Not in future",
"Age >= 18 for adults"
],
"example": "1990-01-15"
}

Email

{
"field": "email",
"rules": [
"Valid email format",
"Contains @ symbol",
"Valid domain"
],
"example": "john.doe@example.com"
}

Phone Numbers

{
"field": "phoneNumber",
"rules": [
"Valid Botswana format",
"Starts with +267 or 267",
"8 digits after country code"
],
"example": "+267 71234567"
}

Validation Status

{
"validation": {
"status": "PASSED" | "FAILED" | "WARNING",
"errors": [
{
"field": "dateOfBirth",
"message": "Date is in the future",
"severity": "ERROR"
}
],
"warnings": [
{
"field": "phoneNumber",
"message": "Phone number format unusual",
"severity": "WARNING"
}
]
}
}

Confidence Scoring

Confidence Levels

ScoreLevelDescription
95-00%ExcellentHigh confidence, no review needed
85-94%GoodMinor review recommended
70-84%FairManual review recommended
< 70%LowRequires manual verification

Factors Affecting Confidence

Positive Factors:

  • Clear, high-resolution images
  • Good lighting conditions
  • Standard document format
  • All text legible
  • No damage or wear

Negative Factors:

  • Poor image quality
  • Glare or shadows
  • Faded or damaged documents
  • Non-standard formats
  • Handwritten text

Confidence Response

{
"confidence": {
"overall": 96.5,
"bySection": {
"personalInfo": 98.,
"identification": 95.5,
"address": 94.8
},
"byField": {
"fullName": 99.,
"idNumber": 98.8,
"dateOfBirth": 97.5,
"address": 94.
},
"quality": {
"imageQuality": 95.0,
"textClarity": 96.5,
"documentCondition": 97.0
}
}
}

Smart Document Merging

Front/Back ID Cards

The system automatically merges data from both sides:

Front Side:

  • Photo
  • Name
  • ID Number
  • Date of Birth

Back Side:

  • Address
  • Issue/Expiry Dates
  • District/Ward
  • Signature

Merged Output:

{
"source": "MERGED",
"pages": ["front.jpg", "back.jpg"],
"extractedData": {
"name": "John Doe",
"idNumber": "123456789",
"dateOfBirth": "1990-01-15",
"address": "Gaborone, Block 6",
"issueDate": "2020-01-15",
"expiryDate": "2030-01-15"
},
"mergeQuality": 98.5
}

Multi-Page PDFs

Extract and combine data across pages:

{
"totalPages": 3,
"extractedData": {
"page1": {
"bankStatement": {
"accountNumber": "1234567890",
"openingBalance": 50000.00
}
},
"page2": {
"transactions": [
{"date": "2025-10-01", "amount": 1500.00}
]
},
"page3": {
"closingBalance": 48500.00
}
},
"consolidated": {
"accountNumber": "1234567890",
"openingBalance": 50000.00,
"closingBalance": 48500.00,
"totalTransactions": 45
}
}

Technology Stack

AI-Powered Processing

Features:

  • Multi-modal understanding (text + images)
  • Structured JSON output
  • High accuracy on Botswana documents
  • Fast processing times

Optimization Techniques

  1. Specialized Processing - Optimized for each document type
  2. Batch Processing - Multiple documents in single request
  3. Smart Caching - Reuse document classifications
  4. Parallel Processing - Concurrent page analysis

Error Handling

Extraction Errors

{
"status": "error",
"errorCode": 2001,
"message": "Failed to extract required fields",
"details": {
"missingFields": ["idNumber", "dateOfBirth"],
"lowConfidenceFields": ["address"],
"documentQuality": "POOR"
},
"suggestions": [
"Improve image quality",
"Ensure all text is visible",
"Remove glare or shadows"
]
}

Common Error Codes

CodeMessageResolution
2001Missing required fieldsCheck document completeness
2002Low extraction confidenceImprove image quality
2003Invalid document formatUse supported formats
2004Text not readableEnhance image clarity

Best Practices

Image Quality

  • Use high-resolution images (min 1200x800px)
  • Ensure good lighting (no glare/shadows)
  • Capture entire document
  • Keep document flat and straight

Document Preparation

  • Remove protective covers
  • Clean document surface
  • Avoid reflective surfaces
  • Use plain background

Processing

  • Send front/back together
  • Use batch for multiple documents
  • Validate before submission
  • Handle low confidence gracefully

Integration Examples

JavaScript

async function extractData(file) {
const formData = new FormData();
formData.append('documents', file);

const response = await fetch('/api/v/kyc/verify/identity', {
method: 'POST',
body: formData
});

const result = await response.json();

// Access extracted data
console.log(result.extractedData);
console.log(`Confidence: ${result.confidence.overall}%`);

return result;
}

Python

import requests

def extract_document_data(file_path):
url = 'https://api.botskyc.com/api/v/kyc/verify/identity'
files = {'documents': open(file_path, 'rb')}

response = requests.post(url, files=files)
data = response.json()

# Access extracted fields
name = data['extractedData']['personalInfo']['fullName']
id_number = data['extractedData']['identification']['idNumber']
confidence = data['confidence']['overall']

print(f"Name: {name}")
print(f"ID: {id_number}")
print(f"Confidence: {confidence}%")

return data

cURL

# Extract identity data
curl -X POST https://api.botskyc.com/api/v/kyc/verify/identity \
-F "documents=@omang.jpg" \
| jq '.extractedData'

# Extract with confidence filtering
curl -X POST https://api.botskyc.com/api/v/kyc/verify/identity \
-F "documents=@omang.jpg" \
| jq 'select(.confidence.overall >= 90)'

Support

For additional assistance: