86 lines
2.3 KiB
Markdown
86 lines
2.3 KiB
Markdown
# Ingest Overview
|
|
The purpose of the Ingest system is to digitize the source material
|
|
|
|
The Ingestion pipeline performs the following steps for each input file:
|
|
1. Extract and normalize text
|
|
a. Identify the region in the input file with text
|
|
b. Run Handwriting Text Recogntion (HTR) to text
|
|
c. Recognize the text language
|
|
d. Provide human correction and feed back into model fine-tuning queue
|
|
2. Extract photographs
|
|
a. Identify the region in the input file with a photograph
|
|
|
|
|
|
# Technologies
|
|
|
|
## Evaluation criteria
|
|
Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
|
|
|
|
1. Accuracy: how accurate is the model at recognizing *handwritten* letters in the target languages (English, German, and Hebrew).
|
|
2. Tuning: how easy is it to tune the model based on human feedback.
|
|
3. Price: how much does it cost per run/tuning.
|
|
4. Simplicity: how much work is it to integrate with the model? For example, does it align with the tech-stack in the other systems?
|
|
|
|
LLM HuggingFace accuracy list: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
|
|
|
|
## (A) Amazon Textract
|
|
|
|
URL: https://aws.amazon.com/pm/textract/
|
|
|
|
1. Accuracy: ?
|
|
2. Tuning: F.
|
|
3. Price: B+. ~$15 per million pages. URL: https://aws.amazon.com/textract/pricing/
|
|
4. Simplicity: B.
|
|
|
|
## (B) DocuPanda
|
|
|
|
URL: https://www.docupanda.io
|
|
|
|
1. Accuracy:
|
|
2. Tuning:
|
|
3. Price:
|
|
4. Simplicity:
|
|
|
|
## (C) Transkribus (Proposed)
|
|
|
|
URL: https://www.transkribus.org
|
|
|
|
1. Accuracy: A+
|
|
2. Tuning: A. Available via the API. URL: https://www.transkribus.org/ai-training
|
|
3. Price: B. 0-60 Euros per month. URL: https://www.transkribus.org/plans
|
|
4. Simplicity: B. Cannot detect language. metagraph API for integration. URL: https://www.transkribus.org/metagrapho
|
|
|
|
## (D) ChatGPT
|
|
|
|
1. Accuracy:
|
|
2. Tuning:
|
|
3. Price:
|
|
4. Simplicity:
|
|
|
|
## (E) LLava
|
|
|
|
URL: https://llava-vl.github.io
|
|
|
|
1. Accuracy:
|
|
2. Tuning:
|
|
3. Price: A+, Free on-device
|
|
4. Simplicity:
|
|
|
|
## (F) InternVL2-Llama3-76B
|
|
|
|
URL: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
|
|
|
|
1. Accuracy:
|
|
2. Tuning:
|
|
3. Price: A+, Free on-device
|
|
4. Simplicity:
|
|
|
|
## (G) Handwriting OCR
|
|
|
|
URL: https://www.handwritingocr.com
|
|
|
|
|
|
1. Accuracy:
|
|
2. Tuning: F, not available
|
|
3. Price: C, $0.06-$0.12 per page. URL: https://www.handwritingocr.com/#pricing
|
|
4. Simplicity: B, URL: https://www.handwritingocr.com/api/docs |