Cassandra Notes 2

My understanding is, the Cassandra data model is sort of like a tree (4 level at most)

                                                                            Key Space
                                                                   Column Family
                                                                     /                          \
[key, super-column, column, column…]               [key, column, column]

1,”Key Space”, it is the root, you can think of it like “warehouse”, highest level of storage. It is just a “name” in the client code, only define in storage-conf.xml.

2, Column: what a confusing name! it is the smallest data unit, the leaf node, it is a name-value pair, for example, {name:”age”, value:”12″, timestamp:1234567}

3, Super Column: a named list of columns. So, it can be the leaf, or has its own leaves.
It can be used to describe data horizonally, for example, {name:”joe smith”, columns:{{name:”gender”,value:”male”,timestamp:123456},{name:”age”,value:”34″,timestamp:123456},{…},{…}}}
It can be used to describe data vertically too, for example, {name:”inventory”, columns:{{name:”tv”, value:”12″}, {name:”dish-washer”, value:”34″}{…}{…}}

Another thing of “Column”, “SuperColumn” is, although they are part of “data model”, however, you DON’T DEFINE them. There is no XML or whatever to define the “data model”. You will still need to think through the “data model” (very important) and put the structure on a piece of paper, but that is all. You just create Columns and SupeerColumns in your code. Your code will be simply telling Cassandra server that “put this column/cell to that column-family/table”.

4, Another concept, which is not defined explicitly in thrift or storage xml, nevertheless is very important too, is the “row”.
To understand “row”, let’s first think about “column family” – as the name implies, “column family” is a family of columns/supercolumns. Because column/supercolumn is the actual data unit, the colulmn-family might be too “general” or “big” – chances are we might need a way to divide the columns/super-columns into “sub-groups” under column-family. It is virtual, and identified by a “key”.

5, “Column Family”, It is more of the concept of “storage” or “table” if you HAVE TO map to relational database.
Differently from Column, SuperColumn, there is no “ColumnFamily” in the client library, ie, you don’t create a ColumnFamily object. It is just a “name”. You do need to define the “column family” in storage-conf.xml, which is very simple.

That is pretty much everything as far as “data model” is concerned.

Use warehouse shipping as example.  Every shipped package has a tracking number; in every package, there might be multiple ordered items. One way to define the data model is

The tree will be like

                             [ “UPS93iksnao930dkdk3331”, “toaster202″,”hair-driver832″]         <—- this is the ROW, key is Package Tracking Number, each row has multiple super-column which has the shipped-item and item-quantity.
         [{name=”item-name”, value=”toaster202″}, {name=”item-quantity”,value=”2″}]

But we are missing some information, for example the “order number”. The warehouse needs to store the order number along with shipping information so that later we can display the package tracking for the customer. I choose to add a new column-family to store the relationship between order and shipping.

[{name=”ship-date”,value=”2/2/2010″},{name=”carrier”, value=”UPS”}]

Now let’s write java client to test:

	public void testAddOneShippedPackage() throws UnsupportedEncodingException, InvalidRequestException, UnavailableException, TimedOutException, TException{
		TTransport transport = null;
			transport = new TSocket("localhost", 9160);
			TProtocol protocol = new TBinaryProtocol(transport);
			Cassandra.Client client = new Cassandra.Client(protocol);;
			//client.batch_insert("SuperInventory", packageNumber, oneRowPerColumnFamily, ConsistencyLevel.ALL);
			String orderNumber = "order484";
			String[] packageTracking = {"UPS68"};
			String[][] itemsInPackage = {{"toaster2","hair-drier-2"},{"coffe-maker-2"}};
			String[][] quantityOfItems = {{"1","2"},{"3"}};
			Map<String, List<ColumnOrSuperColumn>> orderShippedPackages = new HashMap<String, List<ColumnOrSuperColumn>>();
			Map<String, List<ColumnOrSuperColumn>> oneRowPerColumnFamily = new HashMap<String, List<ColumnOrSuperColumn>>();
			List<ColumnOrSuperColumn> shippedPackageRow = new ArrayList<ColumnOrSuperColumn>();
			List<ColumnOrSuperColumn> orderPackageRow = new ArrayList<ColumnOrSuperColumn>();

			long timestamp = System.currentTimeMillis();
			for(int i=0;i<packageTracking.length;i++){
				//add items
				for(int j=0;j<itemsInPackage[i].length;j++){
					System.out.println("processing " + packageTracking[i] + " " + itemsInPackage[i][j] + " " + quantityOfItems[i][j]);
					Column itemNumber = new Column("item-number".getBytes("utf-8"), itemsInPackage[i][j].getBytes("utf-8"),timestamp);
					Column quantityInPackage = new Column("quantityInPackage".getBytes("utf-8"), quantityOfItems[i][j].getBytes("utf-8"),timestamp);
					SuperColumn itemInPackage = new SuperColumn();
					ColumnOrSuperColumn c = new ColumnOrSuperColumn();
					//add the order number
					Column shipDateColumn = new Column("shipped-date".getBytes("utf-8"),"2/2/2010".getBytes("utf-8"),timestamp);
					Column carrierColumn = new Column("carrier".getBytes("utf-8"),"UPS".getBytes("utf-8"),timestamp);
					SuperColumn orderPackageShipped = new SuperColumn();
					ColumnOrSuperColumn c = new ColumnOrSuperColumn();
				//add this row to column family order-shipping-packages, keyed by package number
				oneRowPerColumnFamily.put("ShippedPackages", shippedPackageRow);
				client.batch_insert("SuperInventory", packageTracking[i], oneRowPerColumnFamily, ConsistencyLevel.ALL);

			//now add the order-shipped-package relationship
			orderShippedPackages.put("OrderShippedPackages", orderPackageRow);
			client.batch_insert("SuperInventory", orderNumber, orderShippedPackages, ConsistencyLevel.ALL);


In above test case, once the we sent out a package, we will insert a row to the “ShippedPackages” column family, with key being the package tracking number; also, we will add one row to the “OrderShippedPackages” column family, with key being the order number. The name of the super-column is the tracking number.

Note – keep running this test case with same order number, but different tracking number seems to update the row in the “OrderShippedPackages” with the new super-column (this is what we want). I guess Cassandra find the key already exists in “OrderShippedPackages” thus figure out we want to insert a column/super-column.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s