Amazon announced Redshift this week. Actually, they announced the general availability. They announced that it was coming late last year.
Redshift is the new service that leverages the amazon AWS infrastructure so that you can deploy a data warehouse. I’m not yet convinced that I would want my production data warehouse on AWS, but I can really see the use in a dev and test environment, especially for integration testing.
According to Amazon: Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
A terabyte warehouse for less than $1,000 per year. That is fantastic. For one financial services firm were I created a 16TB warehouse, the price for hardware and database licensing was several million dollars. That was just startup costs. Renewing licenses per year ran into the 10s of thousands of dollars.
Redshift offers optimized query and IO performance for large workloads. They provide columnar storage, compression and parallelization to allow the service to scale to petabytes sizes.
I think one of the interesting specs is that it can use the standards Postgres drivers. I don’t see anywhere, yet, where they say specifically that this was built on Postgres, but I am inferring that.
Pricing starts at $0.85 per hour but with reserved pricing, you can get that down to $0.228 per hour. That brings it down to sub-$1000 per year. You just can’t compete with this on price in your own data center.
IF you want to scale to petabyte, you need to have petabyte in place. In your data center, that is going to cost you a fortune. Once again, AWS takes the first step into moving an entire architecture into the cloud. Is anyone else offering anything close to this? I guess Oracle’s cloud offering is the closest, but, as far as I know, they are not promoting warehouse size instances yet.
Did I say it’s scalable?
Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 2TB XL node and scale up all the way to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Amazon Redshift will place your existing cluster into read-only mode, provision a new cluster of your chosen size, and then copy data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.
Redshift is SQL bases so you can access it with your normal tools. It is fully managed so backups and other admin concerns are automatic and automated. I’m not sure what tools you can use to design your database schemas. Since the database supports columnar data stores, I’m not sure what tools will build the tables. Your data is replicated around multiple nodes so your tool would need to be aware of that also.
You can also use Amazon RDS, map reduce or DymanoDB to source data. You can also pull data directly from S3. All in all, I’m pretty excited to see this offering. I hope I get a client who wants to take a shot at this. I like working on AWS anyway but I would love to work on a Redshift gig.