AWS – Managing Amazon Neptune graph databases

Initial Server Setup with CentOS 8

Amazon Neptune is a graph database that was created specifically to process data that includes a large number of interconnected records. This may not be a familiar use case to everyone, so first, we need to start with a basic description of a graph database.

Let’s say you are building a social networking application, where users can friend each other and comment on each other’s posts. You will end up with data structures that quickly exceed what relational databases were designed to handle. You may have users with millions of followers, and you may want to quickly traverse the relationship graph to find followers that have interests that match the content of your popular users’ latest updates. A purpose-built graph database will give you the best performance in this kind of scenario.

The objects in your graph (entities such as users) are referred to as vertices. A vertex is sort of like a row in a relational database table or an item in a DynamoDB table, but it is enhanced by its relationships to other vertices, which are known as edges. You can think of edges as the lines you draw between vertices in a diagram:

Vertices and edges

Imagine the preceding diagram with billions of interconnected vertices. Amazon Neptune can handle queries across such a graph in milliseconds, and it allows for the creation of multiple read replicas to support high-volume applications.

How to do it…

In this recipe, you will create a Neptune cluster, back it up, and restore it:

  1. Log in to your AWS account and go to the Neptune dashboard.
  2. Click Launch Amazon Neptune.
  3. In the database creation screen, select the default DB engine version and choose db.r4.large as the DB instance class:

Specify DB details
  1. In the Settings section, give the cluster a unique identifier and click Next:

DB instance identifier
  1. On the following screen for advanced settings, go with the defaults and click Create database. On the next screen, you can wait for cluster creation to complete:

Creating a DB instance
  1. Loading data into the cluster and querying that data is outside the scope of this recipe. Check out the extensive documentation on the AWS website for instructions on how to load and query data.
  2. Go to the left-hand menu and choose Snapshots. Click Take snapshot:

Take snapshot
  1. On the next screen, choose the database instance and give the snapshot a name. Click Take snapshot.
  2. Once the snapshot becomes available, select Restore snapshot from the Actions menu. Note that you also have the option to share this snapshot with another account, which can be the basis of a sound disaster recovery plan for your application.
  3. On the following screen, you will configure a new cluster, just as if you were creating one from scratch. The restored snapshot is not overwriting an existing database.

How it works…

When you create a Neptune cluster to host your graph data, several resources are created and managed for you:

  • The primary database instance: This database is available for both reads and writes.
  • Read replicas: You can create up to 15 read replicas. As the name implies, these instances can only be used for reads, allowing you to scale out read-intensive workloads. If the primary database instance fails, Neptune fails over to one of the read replicas, which gives it a reported availability of 99.99%.
  • Cluster volumes: The data is stored across multiple AZs for high availability. All database instances, both the primary and the read replicas, connect to the same cluster volume. If a disk fails in the underlying storage system, it is automatically repaired by Neptune without causing downtime.

Neptune allows you to encrypt the data with Key Management Service (KMS) keys, which are applied not only to the primary cluster volumes but also to snapshots and any other copies of the database.

You connect to your Neptune cluster using one of several endpoints, which are URLs dedicated to a specific purpose:

  • The cluster endpoint is the read/write connection to the primary database instance.
  • The reader endpoint provides round-robin connections to all readable database instances. It cannot be used for writes. Note that the way the round-robin works is by changing the DNS record to resolve to a different IP address, so clients that cache the DNS record will keep connecting to the same replica.
  • Instance endpoints point to specific cluster replica instances. Normally, you won’t need these endpoints, but in some scenarios, it may make sense to connect directly to a replica.

Neptune allows you to use two popular query languages (although they cannot be intermixed):

  • Apache Tinkerpop Gremlin is a popular graph traversal language.
  • SPARQL allows you to query the Resource Definition Framework (RDF) graphs.

Advocates for graph databases claim that they are a more natural way to represent real-world objects and could eventually rival the popularity of relational databases. It’s worth spending some time to acquaint yourself with Neptune so that you can recognize valid use cases and administer graph databases successfully.

Comments are closed.