File And Document Artifacts

External files are modeled internally as document artifacts. In practice they are user-uploaded files, imported source material, datasets, exports, notes, or references that can create a project and then remain attached to it. This keeps upload, extraction, tagging, cross-linking, and review outputs near the user-defined project entity that gives them meaning.

Document artifact storage follows the same account and project model in every runtime. Localhost writes file bytes into .wm-data/objects/; AWS writes the same provider-neutral keys into S3; Azure writes the same provider-neutral keys into Blob Storage. The storage service changes, but the application account, project, document id, and permission rules do not.

The file experience has two surfaces:

The independent workspace does not remove project ownership. It adds a cross-workspace index so users can inspect how external files link to project scopes, source-code pages, requirements, metadata records, and write events without changing the project-centered permission model.

By-Account Storage Model

When a project belongs to an application account, raw and processed file artifacts are stored under that account's project prefix:

accounts/{account_id}/projects/{project_id}/documents/{document_id}/raw/{filename}
accounts/{account_id}/projects/{project_id}/documents/{document_id}/processed/{artifact_name}.json

When a project is private to a local or personal workspace, files are stored under a workspace prefix:

workspaces/{workspace_id}/projects/{project_id}/documents/{document_id}/raw/{filename}
workspaces/{workspace_id}/projects/{project_id}/documents/{document_id}/processed/{artifact_name}.json

The account id in the path is an application account id, not an AWS account id or Azure subscription id. It separates client, business, department, firm, or personal workspace data inside the application model. Production deployments can still choose stronger physical isolation, but the baseline model keeps account separation portable across local, AWS, and Azure backends.

Artifact Shape

Each artifact record should include:

storage_provider records the selected backend, while storage_key remains the provider-neutral object/blob path under the project ownership prefix. Project ownership should be the default because it gives a single place to resolve grants, activity, and billing. Account, team, and user-workspace ownership remain useful for templates, reference libraries, and reusable data objects, but a working document should normally resolve through its assigned project.

Tagging And Cross-Linking

The first tagging layer should support:

Requirement relevance is separate from project_data_elements. Project-level elements describe extracted requirements, standards, source pages, code pages, dependencies, stakeholders, assets, risks, references, and other project-specific concepts. Standard relevance records mark whether an external file is relevant to a versioned requirement framework, external policy, customer standard, source-code area, or domain-specific standard.

Each standard_relevance item should include standard_id, framework_id, organization, framework_version, standard_code, standard_label, crosswalk_group, relevance_type, confidence, and source. The initial prototype includes a domain-specific standards catalog, but the field is intentionally generic: new catalogs can describe internal engineering standards, release gates, customer requirements, compliance frameworks, operational procedures, or other versioned source-of-truth systems.

Recommended relevance_type values are:

Derived extraction outputs should be artifacts too. For example, chunks.json, tables.json, figures.json, and document_metadata.json should retain a parent link to the raw uploaded document.

Project Creation Flow

  1. User opens the project intake modal.
  2. User uploads the external files that define the project.
  3. Client or API extracts initial metadata from filenames, MIME type, file attributes, and later document contents.
  4. User reviews and overrides file title, type, source date, classification, project role, tags, related source/code references, and data-model elements.
  5. User creates the project workspace from the reviewed file assemblage.
  6. API creates the project, owner grant, artifact records, and presigned upload URLs.
  7. Client uploads raw files directly to the selected object store, such as S3 or Azure Blob Storage.
  8. Extraction workers create processed artifacts and update metadata confidence, concept tags, and project data elements.

Metadata Extraction Fallbacks

The backend worker should call src/lib/document-metadata-extractor.js after the raw file lands in object storage or worker-local scratch space. The fallback order is:

  1. ExifTool for broad embedded metadata across images, audio, video, PDF, Office, XMP, IPTC, ID3, and other formats.
  2. piexif through scripts/extract-piexif-metadata.py for optional JPEG/TIFF EXIF recovery. This package is unmaintained, so it is treated as a non-blocking supplement rather than a primary dependency.
  3. Apache Tika when TIKA_APP_JAR is configured, especially for document containers and text-oriented metadata.
  4. file/libmagic for MIME and file-signature detection.
  5. Filename, extension, browser MIME type, size, and modified-date inference as the deterministic terminal fallback.

Each artifact can store metadata_sources, metadata_fallback_pathways, and raw_metadata so the UI can show what contributed to the editable metadata view.

Persistent Metadata Records

Editable file metadata and raw/original metadata should be persisted as separate database records so every save can be audited and compared. The prototype models this with documentMetadataRecords.

Each metadata record should include:

The “Call original” action in the UI simulates the backend calling the original metadata extraction pipeline and saving the returned file metadata as a database record. In production, that call should invoke the metadata worker, store raw metadata in the record, and preserve the edited artifact metadata separately.

Write Events

Users can write notes, mapping decisions, or metadata observations from the file repository or project file manager. Each captured write should create a documentWriteEvents record rather than silently mutating the artifact.

Each write event should include:

The write-event handler should also create a write_context metadata record so the exact document metadata visible at the time of the user write can be reconstructed later.

The file manager supports editing file type, status, classification, source fields, concept tags, performance-standard relevance, assigned project, project data-model elements, persistent metadata records, and write events. Moving a file to another project should update the artifact's project fields and object/blob path.

The UI prototype simulates this flow by creating artifact records in browser state and displaying the provider-specific object/blob paths that the backend should create.