From a2ca9ec8f60f1d02b14d31810c39d2d56e6fcb10 Mon Sep 17 00:00:00 2001 From: Gilad Naor Date: Sun, 27 Oct 2024 09:11:51 -0400 Subject: [PATCH] Adding some initial thoughts on the design --- Design.md | 10 ++--- Ingest/Ingest Design.md | 86 +++++++++++++++++++++++++++++++++++++++ Process/Process Design.md | 44 ++++++++++++++++++++ 3 files changed, 135 insertions(+), 5 deletions(-) create mode 100644 Ingest/Ingest Design.md create mode 100644 Process/Process Design.md diff --git a/Design.md b/Design.md index b73a13f..687f4a2 100644 --- a/Design.md +++ b/Design.md @@ -8,14 +8,14 @@ Input: raw photos Output: List of Artifacts that link the raw photo with a textual representation 1. Ingest all of the raw data files. Photos of letters, postcards, and photos. 2. Digitization: conversation of the data to a textual representation. -3. Normalization: translation of all material to one internal language (e.g. English) -4. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside. -*Data Processing* +*Data Normalization* Input: List of Artifacts Output: Graph of Entities (Person, Location, and Event) -1. Metadata (move to Ingestino?): for each Artifact, extract the Metadata on the Entities that it refers to, such as Person, Event, and Location -2. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts. +1. Metadata: extract the Metadata on the Entities that each Artifact refers to, such as Person, Event, and Location +2. Normalization: translation of all material to one internal language (e.g. English) +3. Artifacts: creating Artifacts from joint raw material. Example: photos of person and the name/time from the photo's backside. +4. Reconciliation: Create and/or Update existing Entities based on the information from the new Artifacts. *Browser* Input: Graph of Entities diff --git a/Ingest/Ingest Design.md b/Ingest/Ingest Design.md new file mode 100644 index 0000000..ef455e6 --- /dev/null +++ b/Ingest/Ingest Design.md @@ -0,0 +1,86 @@ +# Ingest Overview +The purpose of the Ingest system is to digitize the source material + +The Ingestion pipeline performs the following steps for each input file: +1. Extract and normalize text + a. Identify the region in the input file with text + b. Run Handwriting Text Recogntion (HTR) to text + c. Recognize the text language + d. Provide human correction and feed back into model fine-tuning queue +2. Extract photographs + a. Identify the region in the input file with a photograph + + +# Technologies + +## Evaluation criteria +Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack. + +1. Accuracy: how accurate is the model at recognizing *handwritten* letters in the target languages (English, German, and Hebrew). +2. Tuning: how easy is it to tune the model based on human feedback. +3. Price: how much does it cost per run/tuning. +4. Simplicity: how much work is it to integrate with the model? For example, does it align with the tech-stack in the other systems? + +LLM HuggingFace accuracy list: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard + +## (A) Amazon Textract + +URL: https://aws.amazon.com/pm/textract/ + +1. Accuracy: ? +2. Tuning: F. +3. Price: B+. ~$15 per million pages. URL: https://aws.amazon.com/textract/pricing/ +4. Simplicity: B. + +## (B) DocuPanda + +URL: https://www.docupanda.io + +1. Accuracy: +2. Tuning: +3. Price: +4. Simplicity: + +## (C) Transkribus (Proposed) + +URL: https://www.transkribus.org + +1. Accuracy: A+ +2. Tuning: A. Available via the API. URL: https://www.transkribus.org/ai-training +3. Price: B. 0-60 Euros per month. URL: https://www.transkribus.org/plans +4. Simplicity: B. Cannot detect language. metagraph API for integration. URL: https://www.transkribus.org/metagrapho + +## (D) ChatGPT + +1. Accuracy: +2. Tuning: +3. Price: +4. Simplicity: + +## (E) LLava + +URL: https://llava-vl.github.io + +1. Accuracy: +2. Tuning: +3. Price: A+, Free on-device +4. Simplicity: + +## (F) InternVL2-Llama3-76B + +URL: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B + +1. Accuracy: +2. Tuning: +3. Price: A+, Free on-device +4. Simplicity: + +## (G) Handwriting OCR + +URL: https://www.handwritingocr.com + + +1. Accuracy: +2. Tuning: F, not available +3. Price: C, $0.06-$0.12 per page. URL: https://www.handwritingocr.com/#pricing +4. Simplicity: B, URL: https://www.handwritingocr.com/api/docs \ No newline at end of file diff --git a/Process/Process Design.md b/Process/Process Design.md new file mode 100644 index 0000000..1020760 --- /dev/null +++ b/Process/Process Design.md @@ -0,0 +1,44 @@ +# Process Overview +The purpose of the Process system is to extract meta-data from the ingested material in a format useful for the next stage. + +The Process pipeline performs the following steps for each input entity: +1. Normalization: translate all source material into the working language (English? Hebrew?) +2. Metadata: annotate each entity with relevant metadata, such as locations, dates/times, and actors. +3. Reconcilliation: map new entity to existing entities, creating or updating the canoncial entity graph. +4. RAG preparation: chunk the data, create embeddings, store in a vector data-base + +# Objects + +Canonical Entity: + Type: Person + Name: [list_of_names] + References: [entities] + +Metadata: + [Date/time] + [Location] + [Person] + +Entity: + Type: Letter, Photograph + Metadata + Content: english_text + Raw Content: original_text + +# Technologies + +## Evaluation criteria +Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack. + +Normalization: translate between languages +Metadata: extract insights from text +RAG prep: chunking, embedding, vector-DB + +## (A) Llama 3.2 + +URL: https://www.llama.com + +Supports: multilingual translation, metadata extraction, embedding + + +