45 lines
1.3 KiB
Markdown
45 lines
1.3 KiB
Markdown
# Process Overview
|
|
The purpose of the Process system is to extract meta-data from the ingested material in a format useful for the next stage.
|
|
|
|
The Process pipeline performs the following steps for each input entity:
|
|
1. Normalization: translate all source material into the working language (English? Hebrew?)
|
|
2. Metadata: annotate each entity with relevant metadata, such as locations, dates/times, and actors.
|
|
3. Reconcilliation: map new entity to existing entities, creating or updating the canoncial entity graph.
|
|
4. RAG preparation: chunk the data, create embeddings, store in a vector data-base
|
|
|
|
# Objects
|
|
|
|
Canonical Entity:
|
|
Type: Person
|
|
Name: [list_of_names]
|
|
References: [entities]
|
|
|
|
Metadata:
|
|
[Date/time]
|
|
[Location]
|
|
[Person]
|
|
|
|
Entity:
|
|
Type: Letter, Photograph
|
|
Metadata
|
|
Content: english_text
|
|
Raw Content: original_text
|
|
|
|
# Technologies
|
|
|
|
## Evaluation criteria
|
|
Bellow is a prioritized list of the evaluation criteria. This is the most important part to align on before choosing the right tech. stack.
|
|
|
|
Normalization: translate between languages
|
|
Metadata: extract insights from text
|
|
RAG prep: chunking, embedding, vector-DB
|
|
|
|
## (A) Llama 3.2
|
|
|
|
URL: https://www.llama.com
|
|
|
|
Supports: multilingual translation, metadata extraction, embedding
|
|
|
|
|
|
|