Data Extraction
Intelligent data extraction from documents with structured JSON output, field validation, and confidence scoring.
Overview
botsKYC automatically extracts structured data from unstructured documents using advanced AI models. The system analyzes document content, identifies key fields, validates extracted data, and returns results in a standardized JSON format with confidence scores.
Key Features
Structured JSON Output - Consistent schema across all document types
Field Validation - Automatic validation of extracted fields
Confidence Scoring - Per-field and overall confidence metrics
Multi-Page Support - Extract from PDFs and multi-page documents
Smart Merging - Combine data from front/back of documents
Error Detection - Identify missing or invalid fields
How It Works
Extraction Process
1. Document Analysis
The system first analyzes the document to:
- Identify document type
- Detect text regions
- Recognize layout structure
- Classify key-value pairs
2. Data Extraction
AI models extract structured data:
- Names and personal information
- Identification numbers
- Dates and expiry information
- Addresses and locations
- Financial data
- Business information
3. Validation
Each field is validated against rules:
- Format validation (dates, IDs, emails)
- Range validation (ages, amounts)
- Cross-field validation (consistency)
- Required field checking
4. Confidence Scoring
Confidence scores are calculated for:
- Individual fields (0-00%)
- Document sections
- Overall document confidence
Extracted Data Structure
Identity Documents
{
"documentType": "OMANG",
"extractedData": {
"personalInfo": {
"fullName": "John Doe",
"firstName": "John",
"surname": "Doe",
"dateOfBirth": "990-0-5",
"gender": "Male",
"nationality": "Botswana"
},
"identification": {
"idNumber": "123456789",
"documentNumber": "OM123456789",
"issueDate": "2020-01-15",
"expiryDate": "2030-01-15"
},
"address": {
"district": "Gaborone",
"village": "Gaborone",
"ward": "5"
}
},
"confidence": {
"overall": 98.5,
"fields": {
"fullName": 99.,
"idNumber": 98.8,
"dateOfBirth": 97.5
}
},
"validation": {
"status": "PASSED",
"errors": [],
"warnings": []
}
}
Address Documents
{
"documentType": "UTILITY_BILL",
"extractedData": {
"accountHolder": {
"name": "John Doe",
"accountNumber": "ACC-123456"
},
"address": {
"street": "123 Main Street",
"city": "Gaborone",
"postalCode": "00000",
"country": "Botswana"
},
"documentInfo": {
"issuer": "Water Utilities Corporation",
"issueDate": "2025-10-15",
"dueDate": "2025-11-15",
"billPeriod": "September 2025"
},
"charges": {
"currentCharges": 450.00,
"previousBalance": 0.00,
"totalDue": 450.00,
"currency": "BWP"
}
},
"confidence": {
"overall": 96.,
"fields": {
"name": 98.5,
"address": 95.0,
"amount": 97.8
}
}
}
Income Documents
{
"documentType": "PAYSLIP",
"extractedData": {
"employee": {
"name": "John Doe",
"employeeId": "EMP-1001",
"position": "Software Engineer"
},
"employer": {
"name": "ABC Company Ltd",
"address": "Gaborone, Botswana"
},
"payment": {
"payPeriod": "October 2025",
"payDate": "2025-10-31",
"grossSalary": 15000.00,
"netSalary": 12500.00,
"currency": "BWP"
},
"deductions": {
"tax": 2000.00,
"pension": 500.00,
"total": 2500.00
},
"bankDetails": {
"accountNumber": "1234567890",
"bankName": "First National Bank"
}
},
"confidence": {
"overall": 97.8,
"fields": {
"grossSalary": 99.5,
"netSalary": 99.,
"tax": 98.5
}
}
}
Business Entity Documents
{
"documentType": "BIN_CERTIFICATE",
"extractedData": {
"company": {
"name": "ABC Trading (Pty) Ltd",
"registrationNumber": "BIN123456789",
"businessType": "Private Company",
"registrationDate": "2010-05-15"
},
"directors": [
{
"name": "John Doe",
"idNumber": "123456789",
"nationality": "Botswana"
},
{
"name": "Jane Smith",
"idNumber": "987654321",
"nationality": "Botswana"
}
],
"registeredOffice": {
"street": "Plot 123, Main Mall",
"city": "Gaborone",
"postalAddress": "P.O. Box 12345"
},
"businessActivity": "Retail Trading"
},
"confidence": {
"overall": 95.5,
"fields": {
"companyName": 98.0,
"binNumber": 99.5,
"directors": 94.
}
}
}
Field Validation
Validation Rules
ID Numbers (Omang)
{
"field": "idNumber",
"rules": [
"Must be 9 digits",
"Format: XXXXXXXXX",
"Valid checksum"
],
"example": "123456789"
}
Dates
{
"field": "dateOfBirth",
"rules": [
"Valid date format (YYYY-MM-DD)",
"Not in future",
"Age >= 18 for adults"
],
"example": "1990-01-15"
}
Email
{
"field": "email",
"rules": [
"Valid email format",
"Contains @ symbol",
"Valid domain"
],
"example": "john.doe@example.com"
}
Phone Numbers
{
"field": "phoneNumber",
"rules": [
"Valid Botswana format",
"Starts with +267 or 267",
"8 digits after country code"
],
"example": "+267 71234567"
}
Validation Status
{
"validation": {
"status": "PASSED" | "FAILED" | "WARNING",
"errors": [
{
"field": "dateOfBirth",
"message": "Date is in the future",
"severity": "ERROR"
}
],
"warnings": [
{
"field": "phoneNumber",
"message": "Phone number format unusual",
"severity": "WARNING"
}
]
}
}
Confidence Scoring
Confidence Levels
| Score | Level | Description |
|---|---|---|
| 95-00% | Excellent | High confidence, no review needed |
| 85-94% | Good | Minor review recommended |
| 70-84% | Fair | Manual review recommended |
| < 70% | Low | Requires manual verification |
Factors Affecting Confidence
Positive Factors:
- Clear, high-resolution images
- Good lighting conditions
- Standard document format
- All text legible
- No damage or wear
Negative Factors:
- Poor image quality
- Glare or shadows
- Faded or damaged documents
- Non-standard formats
- Handwritten text
Confidence Response
{
"confidence": {
"overall": 96.5,
"bySection": {
"personalInfo": 98.,
"identification": 95.5,
"address": 94.8
},
"byField": {
"fullName": 99.,
"idNumber": 98.8,
"dateOfBirth": 97.5,
"address": 94.
},
"quality": {
"imageQuality": 95.0,
"textClarity": 96.5,
"documentCondition": 97.0
}
}
}
Smart Document Merging
Front/Back ID Cards
The system automatically merges data from both sides:
Front Side:
- Photo
- Name
- ID Number
- Date of Birth
Back Side:
- Address
- Issue/Expiry Dates
- District/Ward
- Signature
Merged Output:
{
"source": "MERGED",
"pages": ["front.jpg", "back.jpg"],
"extractedData": {
"name": "John Doe",
"idNumber": "123456789",
"dateOfBirth": "1990-01-15",
"address": "Gaborone, Block 6",
"issueDate": "2020-01-15",
"expiryDate": "2030-01-15"
},
"mergeQuality": 98.5
}
Multi-Page PDFs
Extract and combine data across pages:
{
"totalPages": 3,
"extractedData": {
"page1": {
"bankStatement": {
"accountNumber": "1234567890",
"openingBalance": 50000.00
}
},
"page2": {
"transactions": [
{"date": "2025-10-01", "amount": 1500.00}
]
},
"page3": {
"closingBalance": 48500.00
}
},
"consolidated": {
"accountNumber": "1234567890",
"openingBalance": 50000.00,
"closingBalance": 48500.00,
"totalTransactions": 45
}
}
Technology Stack
AI-Powered Processing
Features:
- Multi-modal understanding (text + images)
- Structured JSON output
- High accuracy on Botswana documents
- Fast processing times
Optimization Techniques
- Specialized Processing - Optimized for each document type
- Batch Processing - Multiple documents in single request
- Smart Caching - Reuse document classifications
- Parallel Processing - Concurrent page analysis
Error Handling
Extraction Errors
{
"status": "error",
"errorCode": 2001,
"message": "Failed to extract required fields",
"details": {
"missingFields": ["idNumber", "dateOfBirth"],
"lowConfidenceFields": ["address"],
"documentQuality": "POOR"
},
"suggestions": [
"Improve image quality",
"Ensure all text is visible",
"Remove glare or shadows"
]
}
Common Error Codes
| Code | Message | Resolution |
|---|---|---|
| 2001 | Missing required fields | Check document completeness |
| 2002 | Low extraction confidence | Improve image quality |
| 2003 | Invalid document format | Use supported formats |
| 2004 | Text not readable | Enhance image clarity |
Best Practices
Image Quality
- Use high-resolution images (min 1200x800px)
- Ensure good lighting (no glare/shadows)
- Capture entire document
- Keep document flat and straight
Document Preparation
- Remove protective covers
- Clean document surface
- Avoid reflective surfaces
- Use plain background
Processing
- Send front/back together
- Use batch for multiple documents
- Validate before submission
- Handle low confidence gracefully
Integration Examples
JavaScript
async function extractData(file) {
const formData = new FormData();
formData.append('documents', file);
const response = await fetch('/api/v/kyc/verify/identity', {
method: 'POST',
body: formData
});
const result = await response.json();
// Access extracted data
console.log(result.extractedData);
console.log(`Confidence: ${result.confidence.overall}%`);
return result;
}
Python
import requests
def extract_document_data(file_path):
url = 'https://api.botskyc.com/api/v/kyc/verify/identity'
files = {'documents': open(file_path, 'rb')}
response = requests.post(url, files=files)
data = response.json()
# Access extracted fields
name = data['extractedData']['personalInfo']['fullName']
id_number = data['extractedData']['identification']['idNumber']
confidence = data['confidence']['overall']
print(f"Name: {name}")
print(f"ID: {id_number}")
print(f"Confidence: {confidence}%")
return data
cURL
# Extract identity data
curl -X POST https://api.botskyc.com/api/v/kyc/verify/identity \
-F "documents=@omang.jpg" \
| jq '.extractedData'
# Extract with confidence filtering
curl -X POST https://api.botskyc.com/api/v/kyc/verify/identity \
-F "documents=@omang.jpg" \
| jq 'select(.confidence.overall >= 90)'
Support
For additional assistance:
- Email: support@botskyc.com
- API Documentation: API Reference
- 🔧 Getting Started: Quick Start Guide