High-Level Goal
The primary goal is to offload the heavy, time-consuming document processing from your main Laravel application to ensure the user experience is fast and the system is scalable and resilient. The Laravel app will act as the orchestrator and status monitor, but not the worker.Recommended Architecture: Asynchronous & Event-Driven
The most optimal method is a decoupled, event-driven architecture using a message queue. This prevents your Laravel app from being blocked and allows each part of the system to scale independently.Visual Flow Diagram
Detailed Step-by-Step Flow
Step 1: Document Upload and Job Dispatch (Laravel App)
- User Uploads PDF: A user uploads a PDF document through your Laravel application’s front end.
- Initial Record Creation: Laravel creates a record in its own database (e.g., a
documentstable). This record should have astatuscolumn, initialized to'pending'.
documentstable:id,user_id,original_filename,storage_path,status(pending,processing,creating_embeddings,building_graph,completed,failed),error_message.
- Store the File: Laravel does not store the file locally. It immediately uploads the PDF to a shared, cloud-based file storage service like Amazon S3, Google Cloud Storage, or Supabase Storage. This ensures the file is accessible to other services.
- Dispatch a Job: Laravel pushes a message to a Message Queue (like RabbitMQ, Amazon SQS, or even Redis). This is the key to making the process asynchronous. The message payload is simple and contains the necessary information:
- Immediate User Feedback: The Laravel app immediately returns a response to the user, saying “Your document has been uploaded and is now being processed.” The UI can now poll for status updates or use websockets.
Step 2: The Processing Pipeline (A Separate Service)
This is a dedicated, stateless worker service. It could be written in Python, Node.js, or Go—whatever is best for data processing and interacting with the various APIs. Its only job is to listen to the message queue and process documents.- Consume the Message: A worker from your pipeline service picks up the job message from the queue.
- Update Status: The very first thing the worker does is make an API call back to your Laravel application to update the document’s status. This provides real-time feedback.
PUT /api/documents/123with payload{"status": "processing"}.- Your Laravel app should have a secure internal API endpoint for this.
- Fetch the Document: The worker uses the
file_pathfrom the message to download the PDF from the shared file storage (S3). - Call Docling API: The worker sends the PDF to your
doclingAPI server for processing. It waits fordoclingto return the structured data (e.g., text chunks, entities, relationships in a JSON format).
Step 3: Vector Embedding & Storage (In the Worker)
- Update Status: Worker notifies Laravel:
PUT /api/documents/123with{"status": "creating_embeddings"}. - Generate Embeddings: The worker takes the text chunks from the
doclingresponse. For each chunk, it calls an embedding model (e.g., OpenAI’s API, a self-hosted sentence-transformer model) to get a vector. - Save to Supabase/Postgres:
- Optimal Method: Direct Database Connection.
- Why? Performance. You will be inserting potentially hundreds or thousands of vector rows per document. Using an API (like Supabase Edge Functions or PostgREST) for this would be very slow due to the overhead of one HTTP request per insert.
- How: The worker service connects directly to your Supabase Postgres database using standard Postgres credentials. It should perform a bulk
INSERToperation to save all the text chunks and their corresponding vectors in a single, efficient database transaction. - Security: Ensure the worker service is in a trusted environment (e.g., a VPC) and connects to the database over SSL. Store database credentials securely (e.g., AWS Secrets Manager, Doppler, or environment variables).
Step 4: Knowledge Graph Creation (In the Worker)
- Update Status: Worker notifies Laravel:
PUT /api/documents/123with{"status": "building_graph"}. - Parse KG Data: The worker parses the entities and relationships from the
doclingJSON response. - Save to Neo4j:
- Optimal Method: Direct Connection via Bolt Protocol.
- Why? This is the native, standard, and most performant way to interact with Neo4j. An API layer in front of Neo4j for this kind of data ingestion would add unnecessary complexity and latency.
- How: The worker uses an official Neo4j driver (e.g., for Python, JavaScript). It constructs Cypher queries to create the graph. It’s crucial to use
MERGEinstead ofCREATEto avoid creating duplicate nodes for the same entity across different documents. The entire operation for one document should be wrapped in a single transaction.
Step 5: Final Notification
- On Success: If all steps complete successfully, the worker makes one final API call to Laravel.
PUT /api/documents/123with{"status": "completed"}.
- On Failure: If any step fails (Docling API error, database connection issue, etc.), the worker should catch the exception and notify Laravel.
PUT /api/documents/123with{"status": "failed", "error_message": "Failed to generate embeddings: ..."}.- Your pipeline should also have a retry mechanism and a dead-letter queue for jobs that fail repeatedly.
Summary of Optimal Methods
| Task | Optimal Method | Rationale |
|---|---|---|
| Triggering the Process | Message Queue (RabbitMQ, SQS) | Decoupling & Scalability. Laravel doesn’t wait. The queue absorbs spikes in uploads. You can scale the number of worker services independently of the web app. |
| Notifying Laravel of Status | Webhook/Callback API (Worker calls a Laravel API endpoint) | Efficiency & Real-time Updates. It’s a “push” model. The worker actively informs Laravel of progress. This is far better than Laravel constantly “pulling” (polling) the worker for its status. |
| Saving Vectors to Supabase | Direct Database Connection (with bulk inserts) | Performance. Massively faster for bulk data ingestion than making hundreds of individual API calls. The worker is a trusted backend service, making a direct connection appropriate. |
| Saving Graph to Neo4j | Direct Connection via Bolt Driver (with transactional Cypher queries) | Performance & Native Integration. This is the standard, most efficient way to interact with Neo4j, designed for high-performance graph operations from backend services. |
| Sharing the PDF File | Shared Object Storage (S3, GCS, Supabase Storage) | Accessibility & Statelessness. All services (Laravel, Worker) can access the file via a common, scalable storage layer without needing a shared file system. |
