by @skills-il
Process and extract data from scanned Israeli government forms using OCR. Supports Tabu (land registry), Tax Authority forms, Bituach Leumi documents, and other official Israeli paperwork. Use when user asks to OCR Hebrew documents, extract data from Israeli forms, "lesarek tofes", parse Tabu extract, read scanned tax form, or process Israeli government documents. Includes Hebrew OCR configuration, field extraction patterns, and RTL text handling. Do NOT use for handwritten Hebrew recognition (requires specialized models) or non-Israeli form processing.
npx skills-il add skills-il/localization --skill hebrew-ocr-forms| Form Type | Source | Key Identifiers | Common Fields |
|---|---|---|---|
| Nesach Tabu | Land Registry | "נסח טאבו", "לשכת רישום המקרקעין" | Gush, Chelka, Owner, Liens |
| Tofes 106 | Tax Authority | "טופס 106", "דו״ח שנתי למעביד" | Salary, Tax, Employer |
| Ishur Nikui | Tax Authority | "אישור ניכוי מס במקור" | Tax rate, Validity, TZ |
| Tofes 857 | Tax Authority | "טופס 857", "רווח הון" | Transaction, Gain, Tax |
| Ishur Zkauyot | Bituach Leumi | "אישור זכאויות", "ביטוח לאומי" | Benefit type, Amount |
| Tofes 100 | Bituach Leumi | "טופס 100", "דין וחשבון" | Employees, Wages |
| Rishayon Rechev | Vehicle Licensing | "רישיון רכב" | Plate, Owner, Expiry |
See scripts/preprocess_image.py for the full preprocessing pipeline. Key steps:
See scripts/extract_form_fields.py for the full extraction pipeline.
Tesseract configuration for Hebrew forms:
config = (
'--oem 1 ' # LSTM neural net (best for Hebrew)
'--psm 6 ' # Assume uniform block of text
'-l heb+eng ' # Hebrew + English (forms have both)
'-c preserve_interword_spaces=1' # Keep spacing for field alignment
)Tabu Extract (Nesach Tabu) key fields:
Tax Form (Tofes 106) key fields:
import unicodedata
def normalize_bidi_text(text):
"""Normalize bidirectional text from Hebrew OCR output."""
lines = text.split('\n')
normalized = []
for line in lines:
# Strip bidi control characters
clean = ''.join(
c for c in line
if unicodedata.category(c) != 'Cf'
)
clean = ' '.join(clean.split())
if clean:
normalized.append(clean)
return '\n'.join(normalized)After extraction, validate key fields:
User says: "Extract data from this scanned Tabu document" Result: Preprocess image, run Hebrew OCR, identify as Nesach Tabu, extract gush/chelka/owner/rights fields, validate TZ, return structured JSON.
User says: "I have 50 scanned Tofes 106 forms -- extract salary and tax data" Result: Set up batch pipeline -- preprocess each image, OCR with Hebrew+English, extract Tofes 106 fields, validate, output to CSV/JSON with confidence scores.
User says: "The OCR isn't reading the Hebrew text correctly" Result: Diagnose preprocessing -- check image resolution (recommend 300 DPI minimum), verify deskewing, adjust binarization threshold, try different PSM modes, check Hebrew language pack installation.
scripts/preprocess_image.py — Prepare scanned Israeli form images for OCR: grayscale conversion, deskewing rotated scans, adaptive binarization for uneven lighting, morphological noise removal, optional CLAHE contrast enhancement, and border removal. Run: python scripts/preprocess_image.py --helpscripts/extract_form_fields.py — Run Tesseract Hebrew OCR on preprocessed form images and extract structured fields by form type. Supports auto-detection of Tabu, Tofes 106, and other Israeli government forms. Outputs JSON with extracted fields and Israeli ID validation. Run: python scripts/extract_form_fields.py --helpreferences/israeli-form-types.md — Detailed catalog of Israeli government form types (Tabu/land registry, Tax Authority forms, Bituach Leumi documents) with field descriptions, regex extraction patterns, ID validation rules, date/currency formats, and OCR tips per form layout. Consult when identifying an unknown form or building field extraction logic for a specific document type.Cause: Hebrew traineddata not installed
Solution: Install with sudo apt-get install tesseract-ocr-heb (Ubuntu) or brew install tesseract-lang (macOS). Verify with tesseract --list-langs.
Cause: Bidirectional text ordering issue
Solution: Use normalize_bidi_text() function. Ensure Tesseract is using --oem 1 (LSTM mode). Check that the image is not mirrored.
Cause: Poor scan quality, stamps/signatures overlapping text, colored backgrounds Solution: Increase preprocessing -- apply stronger denoising, crop to specific form regions, use ROI-based extraction for known form layouts.
Supported Agents
Trust Score
This skill can execute scripts and commands on your system.
by @skills-il
Schedule meetings, deployments, and events respecting Shabbat, Israeli holidays (chagim), and Hebrew calendar constraints. Use when user asks to schedule around Shabbat, "zmanim", check Israeli holidays, plan around chagim, set Israeli business hours, or needs Hebrew calendar-aware scheduling logic. Includes halachic times (zmanim) via HebCal API, full Israeli holiday calendar, and Israeli business hour conventions. Do NOT use for religious halachic rulings (consult a rabbi) or diaspora 2-day holiday scheduling.
by @skills-il
Write and edit professional content in Hebrew including marketing copy, UX text, articles, emails, and social media posts. Use when user asks to write in Hebrew, "ktov b'ivrit", create Hebrew marketing content, edit Hebrew text, write Hebrew UX copy, or optimize Hebrew content for SEO. Covers grammar rules, formal vs informal register, gendered language handling, and Hebrew SEO best practices. Do NOT use for Hebrew NLP/ML tasks (use hebrew-nlp-toolkit) or translation (use a translation skill).
by @skills-il
Implement right-to-left (RTL) layouts for Hebrew web and mobile applications. Use when user asks about RTL layout, Hebrew text direction, bidirectional (bidi) text, Hebrew CSS, "right to left", or needs to build Hebrew UI. Covers CSS logical properties, Tailwind RTL, React/Vue RTL, Hebrew typography, and font selection. Do NOT use for Arabic RTL (similar but different typography) unless user explicitly asks for shared RTL patterns.