There are many credit card transaction data providers which sell anonymized transactional data with fields such as customer_id, location, sector, and potentially merchant_id. For example, uncleaned data may include fields such as:

description, transaction_date, post_date, amount, transaction_status, transaction_id, merchant_id, category_code, transaction_type, running_balance

Fields, data cleanliness, and data access (through a db or flat file such as S3) vary widely and really depend on the provider. Popular providers are Yodlee and Master Card Advisors. We will be going through an example which pulls large comma-separated transactional information files from S3 and loads this data into AWS Redshift.

First, setup a Redshift cluster and create a table - there are many options to choose from such as using Dense Compute or Dense Storage Machines, every use case is different so it is important to read the documentation. While, there are many different settings, one important choice is which field to use as the sortkey. If you are going to be querying mostly by date then use the date field. However, if your main use case is to join over merchant_ids, then that may be the best choice.

create table credit_card_transaction (
 field_one integer,
 ....
 field_two varchar(100),
 field_three varchar(50),
 transaction_date timestamp encode delta32k sortkey);

After the Redshift table is created, use the COPY command to move data to S3. The Redshift columns need to be mapped exactly to the incoming data. If the incoming data is not cleaned and there are missing fields (i.e. 10 columns instead of 12), the copy command will not work. The most difficult part is cleaning data or understanding when too much data is missing from a transaction to make it not valuable - unfortunately data formatting and presentation vary significantly between providers, so we will not cover it in this article.

copy credit_card_transaction from 's3://data-endpoint' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' removequotes;

Once your data is in Redshift, you can query it using simple SQL statements. The most popular data set currently features ~1mm customers and therefore ~ 100+mm transactions/yr, which Redshift can easily handle in just one table. If the data set becomes too large, we would suggest partitioning the data into different tables arranged by date, but this should not be a problem. You can also connect your Redshift cluster to popular BI tools such as Tableau. At Volmanac, we use our Volmanac Data app to run worker processes to make thousands of queries every time new data is added and save interesting and statistically significant information.

This isn’t that difficult to set up and having the data in your own Redshift cluster lets an investor perform custom queries and analysis - we aren’t sure why anyone would outsource this to an aggregator such as Earnest Research. If you can’t perform these calculations in-house you are already behind the more sophisticated investors.

Update: 11/21/2016

To be clear, the best way to use this data is not to simply sum up transaction amounts at different retailers/businesses. This can lead to a lot of noise and will usually provide inconsistent results. A better (and more predictive) way to use this data is to monitor consumer trends, which is why we have our jobs run thousands of queries. One recent example which was sent to us by a reader involves Hertz. Many funds were short Hertz during the recent earnings miss as customers which had historically made a car rental transaction were increasingly allocating more dollars to car riding services such as Uber any Lyft. This trend along with accelerated car depreciation due to lower resale values caused Hertz stock to drop by over ~20% after earnings were released.