Live import
You can import data on a running Dgraph instance (which may have prior data) using Dgraph CLI command dgraph live referred to as Live Loader. Live Loader sends mutations to a Dgraph cluster and has options to handle unique IDs assignment and to update existing data.
Before you begin
Verify that you have a local folder <local-path-to-data>
containing
- at least one data file in RDF or JSON in plain or gzip format with the data to import
- an optional schema file.
Those files have been generated by an export or by a data migration tool.
Importing data on Dgraph Cloud
-
Obtain dgraph binary or the latest docker image by following the installation instructions. This is required to run Dgraph CLI command
dgraph live
. -
Obtain the
GRPC endpoint
of you Dgraph Cloud backend and a validClient API key
.An administrator gets those information with the following steps:
- Log into the Dgraph Cloud account, select the backend
- In the
Admin
section of the Dgraph Cloud console, go toSettings
and copy the value of thegRPC Endpoint
from theGeneral
tab. - Access the
API Keys
tab to generate anClient API Key
.
GraphQL endpoint
that you can find in the section Overview
. The gRPC endpoint looks like frozen-mango.grpc.us-west-1.aws.cloud.dgraph.io:443
- Run the live loader as follows:
docker run -it --rm -v <local-path-to-data>:/tmp dgraph/dgraph:latest \
dgraph live --slash_grpc_endpoint <grpc-endpoint> -f /tmp/<data-file> -s /tmp/<schema-file> -t <api-key>
Load multiple data files by using
docker run -it --rm -v <local-path-to-data>:/tmp dgraph/dgraph:latest \
dgraph live --slash_grpc_endpoint <grpc-endpoint> -f /tmp -s /tmp/<schema-file> -t <api-key>
When the path provided with -f, --files
option is a directory, then all files
ending in .rdf, .rdf.gz, .json, and .json.gz will be loaded. Be sure that your schema file has another extension (.txt or .schema for example).
dgraph live --slash_grpc_endpoint <grpc-endpoint> -f <local-path-to-data>/<data-file> -s <local-path-to-data>/<schema-file> -t <api-key>
Load multiple data files by using
dgraph live --slash_grpc_endpoint <grpc-endpoint> -f /tmp -s /tmp/<schema-file> -t <api-key>
When the path provided with -f, --files
option is a directory, then all files
ending in .rdf, .rdf.gz, .json, and .json.gz will be loaded. Be sure that your schema file has another extension (.txt or .schema for example).
Batch upserts
You can use Live loader to update existing data, either to modify existing predicates are to add new predicates to existing nodes.
To do so, use the -U, --upsertPredicate
flag or the -x, --xidmap
flag.
upsertPredicate flag
Use the -U, --upsertPredicate
flag to specify the predicate name in your data that will serve as unique identifier.
For example:
dgraph live --files <directory-with-data-files> --schema <path-to-schema-file> --upsertPredicate xid
The upsert predicate used must be present the Dgraph instance or in the schema file and must be indexed.
For each node, Live loader will use the node name provided in the data file as the upsert predicate value.
For example if your data file contains
<_:my.org/customer/1> <firstName> "John" .
The previous command creates or updates the node with predicate xid
equal to my.org/customer/1
and will set it’s predicate firstName
with the value John
.
xidmap flag
dgraph live --files <directory-with-data-files> --schema <path-to-schema-file> --xidmap <local-directory>
Live loader uses -x, --xidmap
directory to lookup the uid
value for each node name used in the data file or to store the mapping between the node names and the generated uid
for every new node.
Import data on Dgraph self-hosted
Run the live loader using the the -a, --alpha
flag as follows
docker run -it --rm -v <local-path-to-data>:/tmp dgraph/dgraph:latest \
dgraph live --alpha <Dgraph Alpha gRPC endpoint> -f /tmp/<data-file> -s /tmp/<schema-file>
Load multiple data files by using
docker run -it --rm -v <local-path-to-data>:/tmp dgraph/dgraph:latest \
dgraph live --alpha <Dgraph Alpha gRPC endpoint> -f /tmp -s /tmp/<schema-file>
--alpha
default value is localhost:9080
. You can specify a comma separated list of alphas addresses in the same cluster to distribute the load.
When the path provided with -f, --files
option is a directory, then all files
ending in .rdf, .rdf.gz, .json, and .json.gz will be loaded. Be sure that your schema file has another extension (.txt or .schema for example).
dgraph live --alpha <grpc-endpoints> -f <local-path-to-data>/<data-file> -s <local-path-to-data>/<schema-file>
--alpha
default value is localhost:9080
. You can specify a comma separated list of alphas addresses in the same cluster to distribute the load.
Load from S3
To live load from Amazon S3 (Simple Storage Service), you must have either permissions to access the S3 bucket from the system performing live load (see IAM setup below) or explicitly add the following AWS credentials set via environment variables:
Environment Variable | Description |
---|---|
AWS_ACCESS_KEY_ID or AWS_ACCESS_KEY |
AWS access key with permissions to write to the destination bucket. |
AWS_SECRET_ACCESS_KEY or AWS_SECRET_KEY |
AWS access key with permissions to write to the destination bucket. |
IAM setup
In AWS, you can accomplish this by doing the following:
- Create an IAM Role with an IAM Policy that grants access to the S3 bucket.
- Depending on whether you want to grant access to an EC2 instance, or to a pod running on EKS, you can do one of these options:
- Instance Profile can pass the IAM Role to an EC2 Instance
- IAM Roles for Amazon EC2 to attach the IAM Role to a running EC2 Instance
- IAM roles for service accounts to associate the IAM Role to a Kubernetes Service Account.
Once your setup is ready, you can execute the live load from S3. As examples:
## short form of S3 URL
dgraph live \
--files s3:///<bucket-name>/<directory-with-data-files> \
--schema s3:///<bucket-name>/<directory-with-data-files>/schema.txt
## long form of S3 URL
dgraph live \
--files s3://s3.<region>.amazonaws.com/<bucket>/<directory-with-data-files> \
--schema s3://s3.<region>.amazonaws.com/<bucket>/<directory-with-data-files>/schema.txt
s3:///
(noticed the triple-slash ///
). The long form for S3 buckets requires a double slash, e.g. s3://
.
Load from MinIO
To live load from MinIO, you must have the following MinIO credentials set via environment variables:
Environment Variable | Description |
---|---|
MINIO_ACCESS_KEY |
Minio access key with permissions to write to the destination bucket. |
MINIO_SECRET_KEY |
Minio secret key with permissions to write to the destination bucket. |
Once your setup is ready, you can execute the bulk load from MinIO:
dgraph live \
--files minio://minio-server:port/<bucket-name>/<directory-with-data-files> \
--schema minio://minio-server:port/<bucket-name>/<directory-with-data-files>/schema.txt
Enterprise Features
Multi-tenancy (Enterprise Feature)
Since multi-tenancy requires ACL,
when using the Live loader you must provide the login credentials using the --creds
flag.
By default, Live loader loads the data into the user’s namespace.
Guardians of the Galaxy can load the data into multiple namespaces.
Using --force-namespace
, a Guardian can load the data into the namespace specified in the data and schema files.
namespace
from the data and schema files exist before loading the data.
For example, to preserve the namespace while loading data first you need to create the namespace(s) and then run the live loader command:
dgraph live \
--schema /tmp/data/1million.schema \
--files /tmp/data/1million.rdf.gz --creds="user=groot;password=password;namespace=0" \
--force-namespace -1
A Guardian of the Galaxy can also load data into a specific namespace. For example, to force the data loading into namespace 123
:
dgraph live \
--schema /tmp/data/1million.schema \
--files /tmp/data/1million.rdf.gz \
--creds="user=groot;password=password;namespace=0" \
--force-namespace 123
namespace
from the data and schema files exist before loading the data.
Encrypted imports (Enterprise Feature)
A new flag --encryption key-file=value
is added to the Live Loader. This option is required to decrypt the encrypted export data and schema files. Once the export files are decrypted, the Live Loader streams the data to a live Alpha instance.
Alternatively, starting with v20.07.0, the vault_*
options can be used to decrypt the encrypted export and schema files.
p
directory will be encrypted. Otherwise, the p
directory is unencrypted.
For example, to load an encrypted RDF/JSON file and schema via Live Loader:
dgraph live \
--files <path-containering-encrypted-data-files> \
--schema <path-to-encrypted-schema> \
--encryption key-file=<path-to-keyfile-to-decrypt-files>
You can import your encrypted data into a new Dgraph Alpha node without encryption enabled.
# Encryption Key from the file path
dgraph live --files "<path-to-gzipped-RDF-or-JSON-file>" --schema "<path-to-schema>" \
--alpha "<dgraph-alpha-address:grpc_port>" --zero "<dgraph-zero-address:grpc_port>" \
--encryption key-file="<path-to-enc_key_file>"
# Encryption Key from HashiCorp Vault
dgraph live --files "<path-to-gzipped-RDF-or-JSON-file>" --schema "<path-to-schema>" \
--alpha "<dgraph-alpha-address:grpc_port>" --zero "<dgraph-zero-address:grpc_port>" \
--vault addr="http://localhost:8200";enc-field="enc_key";enc-format="raw";path="secret/data/dgraph/alpha";role-id-file="./role_id";secret-id-file="./secret_id"
Other Live Loader options
--new_uids
(default: false
): Assign new UIDs instead of using the existing
UIDs in data files. This is useful to avoid overriding the data in a DB already
in operation.
--format
: Specify file format (rdf
or json
) instead of getting it from
filenames. This is useful if you need to define a strict format manually.
-b, --batch
(default: 1000
): Number of N-Quads to send as part of a mutation.
-c, --conc
(default: 10
): Number of concurrent requests to make to Dgraph.
Do not confuse with -C
.
-C, --use_compression
(default: false
): Enable compression for connections to and from the
Alpha server.
--vault
superflag’s options specify the Vault server address, role id, secret id, and
field that contains the encryption key required to decrypt the encrypted export.