File And Document Artifacts

External files are modeled internally as document artifacts. In practice they are user-uploaded files, imported source material, datasets, exports, notes, or references that can create a project and then remain attached to it. This keeps upload, extraction, tagging, cross-linking, and review outputs near the user-defined project entity that gives them meaning.

Document artifact storage follows the same account and project model in every runtime. Localhost writes file bytes into .wm-data/objects/; AWS writes the same provider-neutral keys into S3; Azure writes the same provider-neutral keys into Blob Storage. The storage service changes, but the application account, project, document id, and permission rules do not.

The file experience has two surfaces:

A project-scoped file manager for editing files in the active project context.
An independent file repository at /file-repo for tracking document links across all visible projects, teams, accounts, and private workspaces.

The independent workspace does not remove project ownership. It adds a cross-workspace index so users can inspect how external files link to project scopes, source-code pages, requirements, metadata records, and write events without changing the project-centered permission model.

By-Account Storage Model

When a project belongs to an application account, raw and processed file artifacts are stored under that account's project prefix:

accounts/{account_id}/projects/{project_id}/documents/{document_id}/raw/{filename}
accounts/{account_id}/projects/{project_id}/documents/{document_id}/processed/{artifact_name}.json

When a project is private to a local or personal workspace, files are stored under a workspace prefix:

workspaces/{workspace_id}/projects/{project_id}/documents/{document_id}/raw/{filename}
workspaces/{workspace_id}/projects/{project_id}/documents/{document_id}/processed/{artifact_name}.json

The account id in the path is an application account id, not an AWS account id or Azure subscription id. It separates client, business, department, firm, or personal workspace data inside the application model. Production deployments can still choose stronger physical isolation, but the baseline model keeps account separation portable across local, AWS, and Azure backends.

Artifact Shape

Each artifact record should include:

artifact_id
project_id
account_id
team_id
owner_type
owner_id
storage_key
classification
created_by
created_at
title
document_type
status
source_date
issuing_body
geographic_scope
project_role
data_object_type
metadata_confidence
extraction_summary
concept_tags
standard_relevance
project_data_elements
metadata_record_ids
write_event_ids
related_object_ids

storage_provider records the selected backend, while storage_key remains the provider-neutral object/blob path under the project ownership prefix. Project ownership should be the default because it gives a single place to resolve grants, activity, and billing. Account, team, and user-workspace ownership remain useful for templates, reference libraries, and reusable data objects, but a working document should normally resolve through its assigned project.

Tagging And Cross-Linking

The first tagging layer should support:

concept tags such as architecture, runbook, dataset, release gate, source page, or customer import
document-scoped requirement or standard relevance records
related file/document links
related source-code page, team, account, or project ids
source type and document type
classification and retention category

Requirement relevance is separate from project_data_elements. Project-level elements describe extracted requirements, standards, source pages, code pages, dependencies, stakeholders, assets, risks, references, and other project-specific concepts. Standard relevance records mark whether an external file is relevant to a versioned requirement framework, external policy, customer standard, source-code area, or domain-specific standard.

Each standard_relevance item should include standard_id, framework_id, organization, framework_version, standard_code, standard_label, crosswalk_group, relevance_type, confidence, and source. The initial prototype includes a domain-specific standards catalog, but the field is intentionally generic: new catalogs can describe internal engineering standards, release gates, customer requirements, compliance frameworks, operational procedures, or other versioned source-of-truth systems.

Recommended relevance_type values are:

direct: the document explicitly refers to that framework, requirement, source page, or standard.
version_comparison: the record is the comparable requirement from another version of the same framework.
cross_framework_comparison: the record is the comparable requirement from another framework or source catalog.
candidate: the document type or content suggests possible relevance, but no explicit framework was detected.
user_tag: a reviewer manually tagged the document to a standard.

Derived extraction outputs should be artifacts too. For example, chunks.json, tables.json, figures.json, and document_metadata.json should retain a parent link to the raw uploaded document.

Project Creation Flow

User opens the project intake modal.
User uploads the external files that define the project.
Client or API extracts initial metadata from filenames, MIME type, file attributes, and later document contents.
User reviews and overrides file title, type, source date, classification, project role, tags, related source/code references, and data-model elements.
User creates the project workspace from the reviewed file assemblage.
API creates the project, owner grant, artifact records, and presigned upload URLs.
Client uploads raw files directly to the selected object store, such as S3 or Azure Blob Storage.
Extraction workers create processed artifacts and update metadata confidence, concept tags, and project data elements.

Metadata Extraction Fallbacks

The backend worker should call src/lib/document-metadata-extractor.js after the raw file lands in object storage or worker-local scratch space. The fallback order is:

ExifTool for broad embedded metadata across images, audio, video, PDF, Office, XMP, IPTC, ID3, and other formats.
piexif through scripts/extract-piexif-metadata.py for optional JPEG/TIFF EXIF recovery. This package is unmaintained, so it is treated as a non-blocking supplement rather than a primary dependency.
Apache Tika when TIKA_APP_JAR is configured, especially for document containers and text-oriented metadata.
file/libmagic for MIME and file-signature detection.
Filename, extension, browser MIME type, size, and modified-date inference as the deterministic terminal fallback.

Each artifact can store metadata_sources, metadata_fallback_pathways, and raw_metadata so the UI can show what contributed to the editable metadata view.

Persistent Metadata Records

Editable file metadata and raw/original metadata should be persisted as separate database records so every save can be audited and compared. The prototype models this with documentMetadataRecords.

Each metadata record should include:

metadata_record_id
pk = DOCUMENT#{document_id}
sk = METADATA#{timestamp}#{record_type}
record_type, such as original, user_curated, or write_context
source
document ownership fields: document_id, project_id, account_id, team_id, owner_type, owner_id
storage fields: storage_provider, storage_key
editable metadata snapshot fields
original_metadata
metadata_sources
metadata_fallback_pathways
created_by
created_at

The “Call original” action in the UI simulates the backend calling the original metadata extraction pipeline and saving the returned file metadata as a database record. In production, that call should invoke the metadata worker, store raw metadata in the record, and preserve the edited artifact metadata separately.

Write Events

Users can write notes, mapping decisions, or metadata observations from the file repository or project file manager. Each captured write should create a documentWriteEvents record rather than silently mutating the artifact.

Each write event should include:

write_event_id
pk = DOCUMENT#{document_id}
sk = WRITE#{timestamp}#{write_event_id}
event_type
document_id, project_id, account_id, team_id
actor fields: actor_user_id, actor_email, actor_name
content
content_length
metadata_record_id
collected_context
created_at

The write-event handler should also create a write_context metadata record so the exact document metadata visible at the time of the user write can be reconstructed later.

The file manager supports editing file type, status, classification, source fields, concept tags, performance-standard relevance, assigned project, project data-model elements, persistent metadata records, and write events. Moving a file to another project should update the artifact's project fields and object/blob path.

The UI prototype simulates this flow by creating artifact records in browser state and displaying the provider-specific object/blob paths that the backend should create.