Adding some initial thoughts on the design
This commit is contained in:
parent
4985c6456b
commit
a2ca9ec8f6
10
Design.md
10
Design.md
@ -8,14 +8,14 @@ Input: raw photos
|
|||||||
Output: List of Artifacts that link the raw photo with a textual representation
|
Output: List of Artifacts that link the raw photo with a textual representation
|
||||||
1. Ingest all of the raw data files. Photos of letters, postcards, and photos.
|
1. Ingest all of the raw data files. Photos of letters, postcards, and photos.
|
||||||
2. Digitization: conversation of the data to a textual representation.
|
2. Digitization: conversation of the data to a textual representation.
|
||||||
3. Normalization: translation of all material to one internal language (e.g. English)
|
|
||||||
4. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
|
|
||||||
|
|
||||||
*Data Processing*
|
*Data Normalization*
|
||||||
Input: List of Artifacts
|
Input: List of Artifacts
|
||||||
Output: Graph of Entities (Person, Location, and Event)
|
Output: Graph of Entities (Person, Location, and Event)
|
||||||
1. Metadata (move to Ingestino?): for each Artifact, extract the Metadata on the Entities that it refers to, such as Person, Event, and Location
|
1. Metadata: extract the Metadata on the Entities that each Artifact refers to, such as Person, Event, and Location
|
||||||
2. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
|
2. Normalization: translation of all material to one internal language (e.g. English)
|
||||||
|
3. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
|
||||||
|
4. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
|
||||||
|
|
||||||
*Browser*
|
*Browser*
|
||||||
Input: Graph of Entities
|
Input: Graph of Entities
|
||||||
|
|||||||
86
Ingest/Ingest Design.md
Normal file
86
Ingest/Ingest Design.md
Normal file
@ -0,0 +1,86 @@
|
|||||||
|
# Ingest Overview
|
||||||
|
The purpose of the Ingest system is to digitize the source material
|
||||||
|
|
||||||
|
The Ingestion pipeline performs the following steps for each input file:
|
||||||
|
1. Extract and normalize text
|
||||||
|
a. Identify the region in the input file with text
|
||||||
|
b. Run Handwriting Text Recogntion (HTR) to text
|
||||||
|
c. Recognize the text language
|
||||||
|
d. Provide human correction and feed back into model fine-tuning queue
|
||||||
|
2. Extract photographs
|
||||||
|
a. Identify the region in the input file with a photograph
|
||||||
|
|
||||||
|
|
||||||
|
# Technologies
|
||||||
|
|
||||||
|
## Evaluation criteria
|
||||||
|
Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
|
||||||
|
|
||||||
|
1. Accuracy: how accurate is the model at recognizing *handwritten* letters in the target languages (English, German, and Hebrew).
|
||||||
|
2. Tuning: how easy is it to tune the model based on human feedback.
|
||||||
|
3. Price: how much does it cost per run/tuning.
|
||||||
|
4. Simplicity: how much work is it to integrate with the model? For example, does it align with the tech-stack in the other systems?
|
||||||
|
|
||||||
|
LLM HuggingFace accuracy list: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
|
||||||
|
|
||||||
|
## (A) Amazon Textract
|
||||||
|
|
||||||
|
URL: https://aws.amazon.com/pm/textract/
|
||||||
|
|
||||||
|
1. Accuracy: ?
|
||||||
|
2. Tuning: F.
|
||||||
|
3. Price: B+. ~$15 per million pages. URL: https://aws.amazon.com/textract/pricing/
|
||||||
|
4. Simplicity: B.
|
||||||
|
|
||||||
|
## (B) DocuPanda
|
||||||
|
|
||||||
|
URL: https://www.docupanda.io
|
||||||
|
|
||||||
|
1. Accuracy:
|
||||||
|
2. Tuning:
|
||||||
|
3. Price:
|
||||||
|
4. Simplicity:
|
||||||
|
|
||||||
|
## (C) Transkribus (Proposed)
|
||||||
|
|
||||||
|
URL: https://www.transkribus.org
|
||||||
|
|
||||||
|
1. Accuracy: A+
|
||||||
|
2. Tuning: A. Available via the API. URL: https://www.transkribus.org/ai-training
|
||||||
|
3. Price: B. 0-60 Euros per month. URL: https://www.transkribus.org/plans
|
||||||
|
4. Simplicity: B. Cannot detect language. metagraph API for integration. URL: https://www.transkribus.org/metagrapho
|
||||||
|
|
||||||
|
## (D) ChatGPT
|
||||||
|
|
||||||
|
1. Accuracy:
|
||||||
|
2. Tuning:
|
||||||
|
3. Price:
|
||||||
|
4. Simplicity:
|
||||||
|
|
||||||
|
## (E) LLava
|
||||||
|
|
||||||
|
URL: https://llava-vl.github.io
|
||||||
|
|
||||||
|
1. Accuracy:
|
||||||
|
2. Tuning:
|
||||||
|
3. Price: A+, Free on-device
|
||||||
|
4. Simplicity:
|
||||||
|
|
||||||
|
## (F) InternVL2-Llama3-76B
|
||||||
|
|
||||||
|
URL: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
|
||||||
|
|
||||||
|
1. Accuracy:
|
||||||
|
2. Tuning:
|
||||||
|
3. Price: A+, Free on-device
|
||||||
|
4. Simplicity:
|
||||||
|
|
||||||
|
## (G) Handwriting OCR
|
||||||
|
|
||||||
|
URL: https://www.handwritingocr.com
|
||||||
|
|
||||||
|
|
||||||
|
1. Accuracy:
|
||||||
|
2. Tuning: F, not available
|
||||||
|
3. Price: C, $0.06-$0.12 per page. URL: https://www.handwritingocr.com/#pricing
|
||||||
|
4. Simplicity: B, URL: https://www.handwritingocr.com/api/docs
|
||||||
44
Process/Process Design.md
Normal file
44
Process/Process Design.md
Normal file
@ -0,0 +1,44 @@
|
|||||||
|
# Process Overview
|
||||||
|
The purpose of the Process system is to extract meta-data from the ingested material in a format useful for the next stage.
|
||||||
|
|
||||||
|
The Process pipeline performs the following steps for each input entity:
|
||||||
|
1. Normalization: translate all source material into the working language (English? Hebrew?)
|
||||||
|
2. Metadata: annotate each entity with relevant metadata, such as locations, dates/times, and actors.
|
||||||
|
3. Reconcilliation: map new entity to existing entities, creating or updating the canoncial entity graph.
|
||||||
|
4. RAG preparation: chunk the data, create embeddings, store in a vector data-base
|
||||||
|
|
||||||
|
# Objects
|
||||||
|
|
||||||
|
Canonical Entity:
|
||||||
|
Type: Person
|
||||||
|
Name: [list_of_names]
|
||||||
|
References: [entities]
|
||||||
|
|
||||||
|
Metadata:
|
||||||
|
[Date/time]
|
||||||
|
[Location]
|
||||||
|
[Person]
|
||||||
|
|
||||||
|
Entity:
|
||||||
|
Type: Letter, Photograph
|
||||||
|
Metadata
|
||||||
|
Content: english_text
|
||||||
|
Raw Content: original_text
|
||||||
|
|
||||||
|
# Technologies
|
||||||
|
|
||||||
|
## Evaluation criteria
|
||||||
|
Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
|
||||||
|
|
||||||
|
Normalization: translate between languages
|
||||||
|
Metadata: extract insights from text
|
||||||
|
RAG prep: chunking, embedding, vector-DB
|
||||||
|
|
||||||
|
## (A) Llama 3.2
|
||||||
|
|
||||||
|
URL: https://www.llama.com
|
||||||
|
|
||||||
|
Supports: multilingual translation, metadata extraction, embedding
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Loading…
Reference in New Issue
Block a user