Book a 15 Minute Consultations

If we can help in any way, please don't hesitate to set a time to meet or talk, or leave your details and we'll get back to you.

Looking for a job?

Apply here
Talk with Sales

Automating Data Archival from Neo4j to MongoDB

In today’s data-driven world, organisations constantly face the challenge of managing large datasets efficiently while ensuring optimal performance and long-term data storage. OTW, a leading organisation, encountered such a problem with their Neo4j database, where the growing volume of data started impacting performance. To overcome this, they sought a scalable and reliable solution to automate the archival of older data into MongoDB, ensuring seamless storage management while maintaining data integrity.

Problem Statement

OTW’s Neo4j database was facing significant performance bottlenecks due to an ever-growing dataset. The organisation required a solution that would archive data older than 45 days without compromising the database's integrity. This archival solution had to ensure that older data could be offloaded efficiently into MongoDB, freeing up space in Neo4j and optimising performance. Additionally, there was a need to implement an automatic data purge mechanism in Neo4j to ensure only a rolling 45-day data retention policy was maintained.

Solution

To address the challenges, a custom automated archival solution was developed using Python. This solution dynamically fetched configurations from an S3 bucket, securely retrieved credentials from Vault, and enabled batch processing to archive large volumes of data from Neo4j to MongoDB. The solution also implemented an automatic purge of data from Neo4j post-archival to ensure compliance with the rolling retention policy.

Feature List

01
Dynamic Configuration Retrieval from S3

Automated retrieval of configurations ensures that the solution adapts to changing needs without manual intervention.

02
Secure Credential Management using Vault

Sensitive credentials required for accessing databases are securely stored and retrieved using Vault, ensuring data security.

03
Batch Processing for Scalable Data Archival

This allows for efficient handling of large datasets, making the archival process scalable.

04
Automatic Data Deletion Post Archival

Ensures that only necessary data is retained in Neo4j, preventing unnecessary storage consumption.

05
Robust Error Handling and Detailed Logging

Comprehensive error handling and logging provide insights and allow for quick resolution of issues.

06
Automated Deployment Using GitLab CI and Docker

CI/CD pipelines streamline deployment, ensuring rapid and reliable updates.

07
Scheduled Execution Using Kubernetes CronJobs

CronJobs in Kubernetes enables the automated and periodic execution of archival processes, ensuring regular data management.

08
Customisable Retention and Runtime Configurations

Provides flexibility to adapt to different storage and retention needs.

Tech and Solution Stack

Languages

Python was the primary language used for scripting and automation.

Databases

Neo4j served as the source database, while MongoDB was used as the archival destination.

Configuration Management:

The solution dynamically retrieved configurations from an S3 bucket.

Credential Management

Vault provided a secure mechanism for credential management.

Orchestration

Kubernetes was used for orchestrating the deployment across production and non-production namespaces.

CI/CD

GitLab CI was employed for continuous integration and deployment, with Docker images stored in Harbour for version control.

Hosting

The archival solution is hosted within Kubernetes clusters, ensuring scalability and resilience. OTW maintains separate namespaces for production and non-production environments, minimising interference and enabling thorough testing.

Team & Support

This project required close collaboration with the Neo4j team, who provided essential Cypher queries and valuable insights on efficient data handling. Their expertise in optimising large datasets helped implement the logic for batch processing and ensured the archival process did not impact Neo4j’s performance.

Maintenance

  • Periodic Updates to Dynamic Configurations : Ensuring configurations in S3 are always up-to-date to reflect evolving business needs.
  • Regular Monitoring of Logs and Alerts : Continuous monitoring helps detect issues early and resolve them swiftly.
  • Database Performance Reviews : Routine assessments ensure that the archival solution aligns with retention policies and performance benchmarks.
  • Updating Docker Images and Scripts : Keeping Docker images and scripts current for compatibility with changing infrastructure.

Conclusion

The automated archival solution at OTW has successfully streamlined the management of large datasets. It has significantly improved Neo4j's performance by ensuring older data is efficiently offloaded into MongoDB while maintaining integrity and ease of use. This solution not only addresses immediate storage challenges but also provides a flexible framework for future scalability and evolving data management needs.