Posts Tagged ‘database’

Amazon Redshift – Datawarehouse in the Clouds

February 16th, 2013 Comments off

Amazon announced Redshift this week. Actually, they announced the general availability. They announced that it was coming late last year.

Redshift is the new service that leverages the amazon AWS infrastructure so that you can deploy a data warehouse. I’m not yet convinced that I would want my production data warehouse on AWS, but I can really see the use in a dev and test environment, especially for integration testing.

According to Amazon: Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

A terabyte warehouse for less than $1,000 per year. That is fantastic. For one financial services firm were I created a 16TB warehouse, the price for hardware and database licensing was several million dollars. That was just startup costs. Renewing licenses per year ran into the 10s of thousands of dollars.

Redshift offers optimized query and IO performance for large workloads. They provide columnar storage, compression and parallelization to allow the service to scale to petabytes sizes.

I think one of the interesting specs is that it can use the standards Postgres drivers. I don’t see anywhere, yet, where they say specifically that this was built on Postgres, but I am inferring that.

Pricing starts at $0.85 per hour but with reserved pricing, you can get that down to $0.228 per hour. That brings it down to sub-$1000 per year. You just can’t compete with this on price in your own data center.

IF you want to scale to petabyte, you need to have petabyte in place. In your data center, that is going to cost you a fortune. Once again, AWS takes the first step into moving an entire architecture into the cloud. Is anyone else offering anything close to this?  I guess Oracle’s cloud offering is the closest, but, as far as I know, they are not promoting warehouse size instances yet.

Did I say it’s scalable?

Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 2TB XL node and scale up all the way to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Amazon Redshift will place your existing cluster into read-only mode, provision a new cluster of your chosen size, and then copy data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.

Redshift is SQL bases so you can access it with your normal tools. It is fully managed so backups and other admin concerns are automatic and automated. I’m not sure what tools you can use to design your database schemas. Since the database supports columnar data stores, I’m not sure what tools will build the tables. Your data is replicated around multiple nodes so your tool would need to be aware of that also.

You can also use Amazon RDS, map reduce or DymanoDB to source data. You can also pull data directly from S3. All in all, I’m pretty excited to see this offering. I hope I get a client who wants to take a shot at this. I like working on AWS anyway but I would love to work on a Redshift gig.




MySQL in Spaaaaaace – Amazon Relational Database Service (RDS)

October 27th, 2009 Comments off

Yep, looks like Amazon finally clued in to the fact that SimpleDB is pretty much useless for any mission critical work. They’ve added a new web services, Relational Database Service, abbreviated RDS.

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business.

Amazon RDS gives you access to the full capabilities of a familiar MySQL database. This means the code, applications, and tools you already use today with your existing MySQL databases work seamlessly with Amazon RDS. Amazon RDS automatically patches the database software and backs up your database, storing the backups for a user-defined retention period. You also benefit from the flexibility of being able to scale the compute resources or storage capacity associated with your relational database instance via a single API call. As with all Amazon Web Services, there are no up-front investments required, and you pay only for the resources you use.

This is pretty slick. I haven’t played with it yet as it was just announced but it seems to be an API driven mysql instance. For slightly more than a base instance, 0.11/hour RDS vs 0.10/hour base EC2 (this price is dropping 15% BTW) on a small server, you get a complete server with MySQL installed. You can create and manage your database instances via procedural call (the API) and you can scale to larger instances or additional storage fairly painlessly by also using those APIs. You also pay extra for your storage of course.

That’s about it from what I’ve read. I don’t see any automated replication (beyond the normal AWS safety features) and I don’t see any kind of clustering or sharding. This is not what most people would call a cloud database. It’s just an easy to configure, maintain and grow MySQL server. Not that that’s bad. For a small business with some technical savvy but not a lot of time, this is an awesome addition to AWS. I would be willing to bet that some kind of clustering will come, sooner or later.

Ooops, just stumbled across:

Coming Soon: High Availability Offering — For developers and business who want additional resilience beyond the automated backups provided by Amazon RDS at no additional charge. With the high availability offer, developers and business can easily and cost-effectively provision synchronously replicated DB Instances in multiple availability zones (AZ’s), to protect against failure within a single location.

One of the things I have always liked about AWS is that they really do make it simple. For the uses cases where SimpleDB is appropriate, using it is a no brainer, as is EC2 and S3. AWS even makes queuing simple. RDS keeps to that methodology.

Amazon RDS allows you to use a simple set of web services APIs to create, delete and modify relational database instances (DB Instances). You can also use the APIs to control access and security for your instance(s) and manage your database backups and snapshots. For a full list of the available Amazon RDS APIs, please see the Amazon RDS API Guide. Some of the most commonly used APIs and their functionality are listed below:

CreateDBInstance — Provision a new DB Instance, specifying DB Instance class, storage capacity and the backup retention policy you wish to use. This one API call is all that’s needed to give you access to a running MySQL database, with the software pre-installed and the available resource capacity you request.

ModifyDBInstance — Modify settings for a running DB Instance. This lets you use a single API call to scale the resources available to your DB Instance in response to the load on your database, or change how it is automatically backed up and maintained on your behalf.

DeleteDBInstance — Delete a running DB Instance. With Amazon RDS, you can terminate your DB Instance at any time and pay only for the resources you used.

CreateDBSnapshot — Generate a snapshot of your DB Instance. You can restore your DB Instance to these user-created snapshots at any point, even to reinstate a previously deleted DB Instance.

RestoreDBInstanceToPointInTIme — Create a new DB Instance from a point-in-time backup. You can restore to any point within the retention period you specified, usually up to the last five minutes of your database’s usage.

This is a very cool addition to AWS. I am looking forward to playing with it. It’s important to note that if you are capable of administering your own server and database, you can save money by running a base EC2 instance and DIY. If you want to run any database other than MySQL, you have to do that anyway.


Amazon Web Services – SimpleDB Overview

April 22nd, 2009 1 comment


SimpleDB was Amazon’s first available (in beta) web service. It is a neat feature but it has its downsides. First, SimpleDB is not a relational database. It is a name/value key pair. For simple lookups, it is highly reliable and scalable. For anything more complicated, it falls apart.

SimpleDB is not ACID compliant and has not referential integrity. For that matter, it has not schemas, tables or relationships. Amazon says that it “eliminates the administrative burden of data modeling”. Some things make me say, “Hmmmmm.”

SimpleDB structures data somewhat like a spreadsheet. Think of columns across and values down. A particular column can have multiple values. I provide an example of SimpleDB data in Chapter 6.

Like everything else in AWS, SimpleDB is API based. There is no SQL access here. The APIs are very simple to use: CREATE creates a new domain (worksheet), you can GET, PUT and DELETE items (columns) and values (data), QUERY data or QUERYWITHATTRIBUTES (meta data).

Amazon does have a query language but it is strictly string based. You enter a key value (a key being the name of one of your key/value pairs) and then list possible values. There are simple operators that you can use.

SimpleDB is designed to store small volumes of data and is optimized for that. Amazon recommends that large files be stored in S3 and the pointer to those files stored in SimpleDB.


You pay for three things with SimpleDB: machine usage (executing queries), data transfer and persistent storage.

Machine usage is based on the requests made and the amount of time it takes to satisfy those requests. The CPU is based on the same criteria as an EC2 compute unit. It costs $0.14 per machine hour utilized. You start with 25 machine hours for free and start paying at the 26th hour.

Persistent storage was $1.50 per GB until Dec 2008. That was much more expensive than S3 or EBS. In late 2008, Amazon lowered the costs to a more reasonable $0.25 per GB. That is a significant change.

Data transfer is comparable to the other services: Data transfer in is $0.10 per GB, first 10TB out is $0.17, $0.13 for the next 40TB, $0.11 for the next 100TB and $0.10 for all data over 150TB.

For a limited time, at least until June 2009, the first 25 CPU hours and 1GB per month are free. This is designed to give people a chance to try out the service.

As a database guy, SimpleDB is a non-starter for me. It’s easy enough for me to install MySQL or Postgres (for free) or Oracle (if I want to pay for it) and scale those to almost ridiculous levels. SimpleDB does not provide the transactional consistency required for transaction processing (OLTP) not does it provide the access paths or any of the key features (except maybe partitioning) required in OLAP processing.

These prices are accurate as of the time of writing them. As always, verify before making a decision.

Technorati : , , , , ,

A Quick Overview of a Database for the Cloud: CouchDB

December 21st, 2008 Comments off

Need a database in your cloud? Check out CouchDB.

What is CouchDB?

CouchDB is an Apache project. CouchDB is not a relational database. It seems that cloud computing has spawned, or at least made popular, a new breed of database. Rather than the hierarchical, network or relational databases of yore*, we have a new paradigm: key/value pairs. You declare a field and assign some values.

*I left object database and xml database off my database of yore list as they never really caught on.

SimpleDB is another key/value database that you may have heard of or used. SimpleDB is provided as part of Amazon Web Services (AWS).

What does CouchDB Offer?

CouchDB is accessible via JSON (which I like better than XML for tasks like these) and it uses JavaScript as a query language. CouchDB is document aware. That is, you create a new document and store related data wiithin that document. There is no schema, documents are the important classification of your data..

The really important thing is that CouchDB is highly distributed. It’s this feature that makes it desireable in situations where a relational database does not scale well. According to the Apache CouchDB Documentation:

CouchDB is a peer based distributed database system. Any number of CouchDB hosts (servers and offline-clients) can have independent “replica copies” of the same database, where applications have full database interactivity (query, add, edit, delete). When back online or on a schedule, database changes are replicated bi-directionally.

CouchDB has built-in conflict detection and management and the replication process is incremental and fast, copying only documents and individual fields changed since the previous replication. Most applications require no special planning to take advantage of distributed updates and replication.

Distributed from the ground up. Sweet.

An important note about where CouchDB is different from SimpleDB is that CouchDB is ACID. Rather than using logs for consitency, CouchDB uses redundants sets of data (much like Vertica). CouchDB, like the other key/value databases is “eventually consistent“. That means that it will take time for the replicas to be updated. CouchDB also uses MVCC and readers never block writers. Readers always see a consistent data set.

CouchDB is written in Erlang. That’s a down side to me in that it is not a very common language. If you do need a patch in a hurry, it may be difficult to find someone qualified to write it. CouchDB was originally written in C++ but the author chose to redo it in Erlang for scalability reasons. Hmmm.

That’s the short story on CouchDB. I plan to write more about actually using CouchDB in the near future.


Technorati : , , ,