Adding some initial thoughts on the design

2024-10-27 09:11:51 -04:00 · 2024-10-27 09:11:51 -04:00 · a2ca9ec8f6
commit a2ca9ec8f6
parent 4985c6456b
3 changed files with 135 additions and 5 deletions
--- a/Design.md
+++ b/Design.md
@ -8,14 +8,14 @@ Input: raw photos
 Output: List of Artifacts that link the raw photo with a textual representation
 1. Ingest all of the raw data files. Photos of letters, postcards, and photos.
 2. Digitization: conversation of the data to a textual representation.
 3. Normalization: translation of all material to one internal language (e.g. English)
 4. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
-*Data Processing*
+*Data Normalization*
 Input: List of Artifacts
 Output: Graph of Entities (Person, Location, and Event)
-1. Metadata (move to Ingestino?): for each Artifact, extract the Metadata on the Entities that it refers to, such as Person, Event, and Location
+1. Metadata: extract the Metadata on the Entities that each Artifact refers to, such as Person, Event, and Location
-2. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
+2. Normalization: translation of all material to one internal language (e.g. English)
 3. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside.
 4. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts.
 *Browser*
 Input: Graph of Entities
--- a/Ingest/Ingest
+++ b/Ingest/Ingest
@ -0,0 +1,86 @@
 # Ingest Overview
 The purpose of the Ingest system is to digitize the source material
 The Ingestion pipeline performs the following steps for each input file:
 1. Extract and normalize text
    a. Identify the region in the input file with text
    b. Run Handwriting Text Recogntion (HTR) to text
    c. Recognize the text language
    d. Provide human correction and feed back into model fine-tuning queue
 2. Extract photographs
    a. Identify the region in the input file with a photograph
 # Technologies
 ## Evaluation criteria
 Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
 1. Accuracy: how accurate is the model at recognizing *handwritten* letters in the target languages (English, German, and Hebrew).
 2. Tuning: how easy is it to tune the model based on human feedback.
 3. Price: how much does it cost per run/tuning.
 4. Simplicity: how much work is it to integrate with the model? For example, does it align with the tech-stack in the other systems?
 LLM HuggingFace accuracy list: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
 ## (A) Amazon Textract
 URL: https://aws.amazon.com/pm/textract/
 1. Accuracy: ?
 2. Tuning: F.
 3. Price: B+. ~$15 per million pages. URL: https://aws.amazon.com/textract/pricing/ 
 4. Simplicity: B.
 ## (B) DocuPanda
 URL: https://www.docupanda.io
 1. Accuracy:
 2. Tuning: 
 3. Price:
 4. Simplicity:
 ## (C) Transkribus (Proposed)
 URL: https://www.transkribus.org
 1. Accuracy: A+
 2. Tuning: A. Available via the API. URL: https://www.transkribus.org/ai-training
 3. Price: B. 0-60 Euros per month. URL: https://www.transkribus.org/plans
 4. Simplicity: B. Cannot detect language. metagraph API for integration. URL: https://www.transkribus.org/metagrapho
 ## (D) ChatGPT
 1. Accuracy:
 2. Tuning: 
 3. Price:
 4. Simplicity:
 ## (E) LLava
 URL: https://llava-vl.github.io
 1. Accuracy:
 2. Tuning: 
 3. Price: A+, Free on-device
 4. Simplicity:
 ## (F) InternVL2-Llama3-76B
 URL: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
 1. Accuracy:
 2. Tuning: 
 3. Price: A+, Free on-device
 4. Simplicity:
 ## (G) Handwriting OCR
 URL: https://www.handwritingocr.com
 1. Accuracy:
 2. Tuning: F, not available
 3. Price: C, $0.06-$0.12 per page. URL: https://www.handwritingocr.com/#pricing
 4. Simplicity: B, URL: https://www.handwritingocr.com/api/docs
--- a/Process/Process
+++ b/Process/Process
@ -0,0 +1,44 @@
 # Process Overview
 The purpose of the Process system is to extract meta-data from the ingested material in a format useful for the next stage.
 The Process pipeline performs the following steps for each input entity:
 1. Normalization: translate all source material into the working language (English? Hebrew?)
 2. Metadata: annotate each entity with relevant metadata, such as locations, dates/times, and actors.
 3. Reconcilliation: map new entity to existing entities, creating or updating the canoncial entity graph.
 4. RAG preparation: chunk the data, create embeddings, store in a vector data-base
 # Objects
 Canonical Entity:
    Type: Person
    Name: [list_of_names]
    References: [entities]
 Metadata:
    [Date/time]
    [Location]
    [Person]
 Entity:
    Type: Letter, Photograph
    Metadata
    Content: english_text
    Raw Content: original_text
 # Technologies
 ## Evaluation criteria
 Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
 Normalization: translate between languages
 Metadata: extract insights from text
 RAG prep: chunking, embedding, vector-DB
 ## (A) Llama 3.2
 URL: https://www.llama.com
 Supports: multilingual translation, metadata extraction, embedding