Adding some initial thoughts on the design

This commit is contained in:
Gilad Naor 2024-10-27 09:11:51 -04:00
parent 4985c6456b
commit a2ca9ec8f6
3 changed files with 135 additions and 5 deletions

View File

@ -8,14 +8,14 @@ Input: raw photos
Output: List of Artifacts that link the raw photo with a textual representation
1. Ingest all of the raw data files. Photos of letters, postcards, and photos.
2. Digitization: conversation of the data to a textual representation.
3. Normalization: translation of all material to one internal language (e.g. English)
4. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
*Data Processing*
*Data Normalization*
Input: List of Artifacts
Output: Graph of Entities (Person, Location, and Event)
1. Metadata (move to Ingestino?): for each Artifact, extract the Metadata on the Entities that it refers to, such as Person, Event, and Location
2. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
1. Metadata: extract the Metadata on the Entities that each Artifact refers to, such as Person, Event, and Location
2. Normalization: translation of all material to one internal language (e.g. English)
3. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
4. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
*Browser*
Input: Graph of Entities

86
Ingest/Ingest Design.md Normal file
View File

@ -0,0 +1,86 @@
# Ingest Overview
The purpose of the Ingest system is to digitize the source material
The Ingestion pipeline performs the following steps for each input file:
1. Extract and normalize text
a. Identify the region in the input file with text
b. Run Handwriting Text Recogntion (HTR) to text
c. Recognize the text language
d. Provide human correction and feed back into model fine-tuning queue
2. Extract photographs
a. Identify the region in the input file with a photograph
# Technologies
## Evaluation criteria
Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
1. Accuracy: how accurate is the model at recognizing *handwritten* letters in the target languages (English, German, and Hebrew).
2. Tuning: how easy is it to tune the model based on human feedback.
3. Price: how much does it cost per run/tuning.
4. Simplicity: how much work is it to integrate with the model? For example, does it align with the tech-stack in the other systems?
LLM HuggingFace accuracy list: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
## (A) Amazon Textract
URL: https://aws.amazon.com/pm/textract/
1. Accuracy: ?
2. Tuning: F.
3. Price: B+. ~$15 per million pages. URL: https://aws.amazon.com/textract/pricing/
4. Simplicity: B.
## (B) DocuPanda
URL: https://www.docupanda.io
1. Accuracy:
2. Tuning:
3. Price:
4. Simplicity:
## (C) Transkribus (Proposed)
URL: https://www.transkribus.org
1. Accuracy: A+
2. Tuning: A. Available via the API. URL: https://www.transkribus.org/ai-training
3. Price: B. 0-60 Euros per month. URL: https://www.transkribus.org/plans
4. Simplicity: B. Cannot detect language. metagraph API for integration. URL: https://www.transkribus.org/metagrapho
## (D) ChatGPT
1. Accuracy:
2. Tuning:
3. Price:
4. Simplicity:
## (E) LLava
URL: https://llava-vl.github.io
1. Accuracy:
2. Tuning:
3. Price: A+, Free on-device
4. Simplicity:
## (F) InternVL2-Llama3-76B
URL: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
1. Accuracy:
2. Tuning:
3. Price: A+, Free on-device
4. Simplicity:
## (G) Handwriting OCR
URL: https://www.handwritingocr.com
1. Accuracy:
2. Tuning: F, not available
3. Price: C, $0.06-$0.12 per page. URL: https://www.handwritingocr.com/#pricing
4. Simplicity: B, URL: https://www.handwritingocr.com/api/docs

44
Process/Process Design.md Normal file
View File

@ -0,0 +1,44 @@
# Process Overview
The purpose of the Process system is to extract meta-data from the ingested material in a format useful for the next stage.
The Process pipeline performs the following steps for each input entity:
1. Normalization: translate all source material into the working language (English? Hebrew?)
2. Metadata: annotate each entity with relevant metadata, such as locations, dates/times, and actors.
3. Reconcilliation: map new entity to existing entities, creating or updating the canoncial entity graph.
4. RAG preparation: chunk the data, create embeddings, store in a vector data-base
# Objects
Canonical Entity:
Type: Person
Name: [list_of_names]
References: [entities]
Metadata:
[Date/time]
[Location]
[Person]
Entity:
Type: Letter, Photograph
Metadata
Content: english_text
Raw Content: original_text
# Technologies
## Evaluation criteria
Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
Normalization: translate between languages
Metadata: extract insights from text
RAG prep: chunking, embedding, vector-DB
## (A) Llama 3.2
URL: https://www.llama.com
Supports: multilingual translation, metadata extraction, embedding