Adding some initial thoughts on the design

2024-10-27 09:11:51 -04:00 · 2024-10-27 09:11:51 -04:00 · a2ca9ec8f6
commit a2ca9ec8f6
parent 4985c6456b
3 changed files with 135 additions and 5 deletions
--- a/Design.md
+++ b/Design.md
@ -8,14 +8,14 @@ Input: raw photos
 Output: List of Artifacts that link the raw photo with a textual representation
 1. Ingest all of the raw data files. Photos of letters, postcards, and photos.
 2. Digitization: conversation of the data to a textual representation.
-3. Normalization: translation of all material to one internal language (e.g. English)
-4. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.

-*Data Processing*
+*Data Normalization*
 Input: List of Artifacts
 Output: Graph of Entities (Person, Location, and Event)
-1. Metadata (move to Ingestino?): for each Artifact, extract the Metadata on the Entities that it refers to, such as Person, Event, and Location
-2. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
+1. Metadata: extract the Metadata on the Entities that each Artifact refers to, such as Person, Event, and Location
+2. Normalization: translation of all material to one internal language (e.g. English)
+3. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
+4. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.

 *Browser*
 Input: Graph of Entities
--- a/Ingest/Ingest
+++ b/Ingest/Ingest
@ -0,0 +1,86 @@
+# Ingest Overview
+The purpose of the Ingest system is to digitize the source material
+
+The Ingestion pipeline performs the following steps for each input file:
+1. Extract and normalize text
+    a. Identify the region in the input file with text
+    b. Run Handwriting Text Recogntion (HTR) to text
+    c. Recognize the text language
+    d. Provide human correction and feed back into model fine-tuning queue
+2. Extract photographs
+    a. Identify the region in the input file with a photograph
+
+
+# Technologies
+
+## Evaluation criteria
+Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
+
+1. Accuracy: how accurate is the model at recognizing *handwritten* letters in the target languages (English, German, and Hebrew).
+2. Tuning: how easy is it to tune the model based on human feedback.
+3. Price: how much does it cost per run/tuning.
+4. Simplicity: how much work is it to integrate with the model? For example, does it align with the tech-stack in the other systems?
+
+LLM HuggingFace accuracy list: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
+
+## (A) Amazon Textract
+
+URL: https://aws.amazon.com/pm/textract/
+
+1. Accuracy: ?
+2. Tuning: F.
+3. Price: B+. ~$15 per million pages. URL: https://aws.amazon.com/textract/pricing/ 
+4. Simplicity: B.
+
+## (B) DocuPanda
+
+URL: https://www.docupanda.io
+
+1. Accuracy:
+2. Tuning: 
+3. Price:
+4. Simplicity:
+
+## (C) Transkribus (Proposed)
+
+URL: https://www.transkribus.org
+
+1. Accuracy: A+
+2. Tuning: A. Available via the API. URL: https://www.transkribus.org/ai-training
+3. Price: B. 0-60 Euros per month. URL: https://www.transkribus.org/plans
+4. Simplicity: B. Cannot detect language. metagraph API for integration. URL: https://www.transkribus.org/metagrapho
+
+## (D) ChatGPT
+
+1. Accuracy:
+2. Tuning: 
+3. Price:
+4. Simplicity:
+
+## (E) LLava
+
+URL: https://llava-vl.github.io
+
+1. Accuracy:
+2. Tuning: 
+3. Price: A+, Free on-device
+4. Simplicity:
+
+## (F) InternVL2-Llama3-76B
+
+URL: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
+
+1. Accuracy:
+2. Tuning: 
+3. Price: A+, Free on-device
+4. Simplicity:
+
+## (G) Handwriting OCR
+
+URL: https://www.handwritingocr.com
+
+
+1. Accuracy:
+2. Tuning: F, not available
+3. Price: C, $0.06-$0.12 per page. URL: https://www.handwritingocr.com/#pricing
+4. Simplicity: B, URL: https://www.handwritingocr.com/api/docs
--- a/Process/Process
+++ b/Process/Process
@ -0,0 +1,44 @@
+# Process Overview
+The purpose of the Process system is to extract meta-data from the ingested material in a format useful for the next stage.
+
+The Process pipeline performs the following steps for each input entity:
+1. Normalization: translate all source material into the working language (English? Hebrew?)
+2. Metadata: annotate each entity with relevant metadata, such as locations, dates/times, and actors.
+3. Reconcilliation: map new entity to existing entities, creating or updating the canoncial entity graph.
+4. RAG preparation: chunk the data, create embeddings, store in a vector data-base
+
+# Objects
+
+Canonical Entity:
+    Type: Person
+    Name: [list_of_names]
+    References: [entities]
+
+Metadata:
+    [Date/time]
+    [Location]
+    [Person]
+
+Entity:
+    Type: Letter, Photograph
+    Metadata
+    Content: english_text
+    Raw Content: original_text
+
+# Technologies
+
+## Evaluation criteria
+Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
+
+Normalization: translate between languages
+Metadata: extract insights from text
+RAG prep: chunking, embedding, vector-DB
+
+## (A) Llama 3.2
+
+URL: https://www.llama.com
+
+Supports: multilingual translation, metadata extraction, embedding
+
+
+