Redshift

Redshift is a powerful data warehouse from AWS. It enables you to store and query all your customer data at once so you can get a full picture of the customer journey.

Getting Started

Redshift can be challenging to work with due to the fact that it's not performant to load events directly as they come in. Thus to maximize performance, best practice is to build a data pipeline to process and load data in batches. This is what our system does.

Our system will dynamically create schemas, tables, and add columns as you send data. For example, we will create a new table for every unique bigPicture.track() event you send us and a new column for every unique property.

So we highly recommend creating a tracking plan and being aware of all data you send before setting up BigPicture.io. Another option is to selectively choose which integrations you want to send data to (such as ignoring Redshift) in order to better control what data is being sent.

Provision your Redshift cluster

If you already have an existing cluster you can skip this section.

1. Navigate to the Redshift dashboard.

2. Click "Launch cluster".

3. Set your cluster details. Be sure to choose a secure password.

4. Select your cluster size. Generally you can start with one node and add more as you need them.

5. Additional configuration.

  • Encryption - You can optionally encrypt your database. It will not impact our ability to load data.
  • Publicly accessible - This needs to be set to "Yes" in order for our system to connect.
  • Security groups - We recommend creating a new security group. See below on how to configure your settings.

Permissioning BigPicture.io

At this point you should either have an existing cluster, or are at the last stages of getting one setup.

Now you need to configure access for BigPicture.io to connect to your cluster.

Networking

1. Navigate to the VPC dashboard.

2. Click on the "Security Groups" tab, then click on "Create Security Group".

3. Give it a name and description, then create it.

4. Select your group, then at the bottom, click on the "Inbound Rules" tab.

5. Create a Redshift rule and enter:

34.208.14.169/32

This is the IP BigPicture.io will connect to your cluster from.

6. Now that you have your security group setup, you can go back to the cluster configuration and select this group.

Or if you have an existing cluster, navigate to the "Modify" section of your cluster, then set your new group. Make sure you have set "Publicly accessible" to "Yes".

Database user

The username and password you created in your cluster configuration is for the admin user. You should not share the admin user with any 3rd party, and it's best practice to create a new, unique user.

To create a new Redshift user, you'll need to login directly to Redshift. AWS does not provide or install any SQL client tools, so they recommend using a tool like SQL Workbench/J.

NOTE: When logging in to your cluster, you will have to whitelist your own IP address as well.


Once you're logged in. Execute the following commands to create a new database user:


-- create a user named "bigpicture" that our system will use when connecting to your Redshift cluster.
CREATE USER bigpicture PASSWORD 'YOUR_PASSWORD_HERE';

-- allows the "bigpicture" user to create new schemas on the specified database. (this is the name you chose when provisioning your cluster)
GRANT CREATE ON DATABASE "YOUR_DATABASE_NAME_HERE" TO bigpicture;

This user is what you'll provide in the BigPicture.io configuration when setting up Redshift.

Handling an existing database

If you have an existing Redshift database and want BigPicture.io to write to your existing schemas / tables, you'll also have to grant access to the new user you created.

For example, if you are migrating from another 3rd party service that loads data into Redshift, and you want BigPicture.io to write to the existing schemas rather than create new ones.

First, you'll have to execute this command to grant privileges to any future tables created in the existing schemas:


-- Set permissions on any newly created tables
ALTER DEFAULT PRIVILEGES FOR USER bigpicture GRANT ALL ON TABLES TO bigpicture;

Next, you need to grant access to all of your existing schemas and set the table owner to the bigpicture user you created. The user needs to be the table owner because our system will dynamically create new tables and add new columns as you send data.


-- Grant existing schemas
GRANT ALL ON SCHEMA YOUR_SCHEMA_NAME_HERE to bigpicture;

Now set bigpicture as the owner for each table. This can be a bit tedious if you have a lot of tables...



-- Make bigpicture user the owner. Do this for each table.
ALTER TABLE YOUR_TABLE_NAME_HERE OWNER to bigpicture;

This may be a little complicated for non-technical users. So feel free to email us at support@bigpicture.io if you need help.

Sync schedule

Currently data is synced once an hour. Let us know if you need a custom schedule.

Distribution key

The id column is the common distribution key used across all tables. If sending external data that already has a unique id (like Zendesk) then we will use that as the id. Otherwise, we will create a unique hash to avoid duplicate rows from being inserted.

Reserved words

Redshift limits the use of reserved words in schema, table, and column names. We mitigate this by wrapping all names in quotations within queries, but you should still be careful to avoid conflicts.

In addition to Redshift reserved words, you should avoid sending data that conflicts with BigPicture.io top level fields (id, userId, messageId, receivedAt, etc.). If your traits or properties conflict with BigPicture.io fields, we will prepend a _ to the name.

Lastly, Redshift limits the use of integers at the start of names. So we will also prepend a _ for those names. For example, if you send a property named 5guys, we will modify it to _5guys before loading it into Redshift.

Technical details

Data loading

Our system first batches data on S3 then leverages the COPY command to load data as quickly as possible.

If a table or column doesn't exist, we will lock the table and update it.

When we process a batch and discover a new column, we will use the first instance of the data and use that as the column datatype.

Handling existing columns

When we process a batch and find an existing column datatype that doesn't match the datatype you're sending, we will first try to convert the incoming data.

For example, you previously sent price: 99, where 99 is a number. Then later on you send price: "99", where "99" is a string. Our system will attempt to convert the string to a number.

If the data cannot be converted to the datatype of the existing column, we will drop that data to avoid having the whole batch load fail. For this reason, we highly recommend creating a detailed tracking plan to avoid conflicts and changing datatypes.

Security

Your VPC enables you to block traffic from the internet and selectively whitelist IPs. For this integration, you whitelist the BigPicture.io IPs so that our servers can access your cluster.

Additionally, by creating a unique user to access your cluster, you can isolate queries and audit access.

Column sizing

We set the default string column length limit to 512 characters. Longer strings are truncated. This decision is based on general industry standards.