Learning Apache Cassandra(Second Edition)
上QQ阅读APP看书,第一时间看更新

Multiple clustering columns

Clustering columns are not limited to one field as specified before. Let's take a look at how multiple clustering columns work and facilitate data ordering. To illustrate this, we will recreate our status updates table so that it is clustered by the date and time when the user updated their status:

CREATE TABLE "user_status_updates_by_datetime" ( 
"username" text,
"status_date" date,
"status_time" time,
"body" text,
PRIMARY KEY ("username", "status_date", "status_time")
);

We have created a new table user_status_updates_by_datetime as shown next:

  • Partition key: username, which is a text field.
  • Clustering columns: status_date and status_time. Rows for a particular username are clustered by both columns, first by status_date and then by status_time.

Since our primary goal is to figure out how a primary key with multiple clustering columns behaves, we will be inserting updates for a single user, alice.

Let's insert a few records as follows:

INSERT INTO "user_status_updates_by_datetime" ("username",  
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-18', '08:30:55.123', 'Alice Update 1');

INSERT INTO "user_status_updates_by_datetime" ("username",
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-18', '14:40:25.123456789', 'Alice Update 2');

INSERT INTO "user_status_updates_by_datetime" ("username",
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-19', '08:25:25', 'Alice Update 3');

INSERT INTO "user_status_updates_by_datetime" ("username",
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-21', '08:35:55.123456', 'Alice Update 4');

INSERT INTO "user_status_updates_by_datetime" ("username",
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-21', '14:30:15.123', 'Alice Update 5');

INSERT INTO "user_status_updates_by_datetime" ("username",
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-23', '14:50:45.123456', 'Alice Update 6');

You will notice that status_date and status_time, which are date and time data types respectively, are inserted as strings similar to text data type.

As mentioned earlier, date should be of the format YYYY-MM-DD, and time should be of the format HH:MM:SS.

The CQL protocol enforces strict rules on values you can enter for any field, be it year, month, day, hour, minute, or second. Time is specified with nanosecond precision. It isn't necessary to specify all nine digits of the nanoseconds as we saw in the preceding INSERT statements.

Let's try to insert a few records with erroneous input:

INSERT INTO "user_status_updates_by_datetime" ("username",  
"status_date", "status_time", "body")
VALUES ('alice', '2016-14-23', '14:50:45.123456', 'Alice Update 7');

The response is as follows:

INSERT INTO "user_status_updates_by_datetime" ("username",  
"status_date", "status_time", "body")
VALUES ('alice', '2016-11-23', '14:65:45.123456', 'Alice Update 8');

The response is as follows:

We notice that both of the preceding queries failed. The first insert failed because of an invalid value for status_date, which has a wrong value for month. The second insert failed because of an invalid value for status_time, which has a wrong value for minutes. Underneath, CQL protocol converts both status_date and status_time to a long format and checks for correctness of different values within the data types, date, and time.

Now, if we want to fetch all the status updates by username alice regardless of status_date or status_time, we simply run the following query:

SELECT * FROM "user_status_updates_by_datetime";

The response is as follows:

You will notice that all the status updates for the user alice are ordered by status_date first and then by status_time for a particular status_date value. The order in which the inserts were performed doesn't matter. The results would still have been ordered by date and time.

Let's try to run some queries on clustering columns and see which queries are supported and which are not. First, we will fetch all status updates before 2016-11-20 regardless of the time:

SELECT * FROM "user_status_updates_by_datetime" 
WHERE "username" = 'alice' AND "status_date" < '2016-11-20';

The response is as follows:

You will see that the query returns three status updates, which were posted before 2016-11-20. Always make sure your query contains an equality relation on the partition key. Clustering columns support both equality and inequality relations although there are certain limitations.

Let's try to get status updates that were posted after 2016-11-20 and after noon or 12:00:00. This can be done by running the following query:

SELECT * FROM "user_status_updates_by_datetime" 
WHERE "username" = 'alice' AND "status_date" > '2016-11-20' AND
"status_time" > '12:00:00';

The response is as follows:

This throws a cryptic error saying it is an invalid query. This is one of the limitations when querying is based on multiple clustering columns. This simply means that if you are using an inequality relation on a clustering column, all the preceding clustering columns need to have an equality relation. Let's test this by fetching status updates that were posted on 2016-11-21 and after 12:00:00:

SELECT * FROM "user_status_updates_by_datetime" 
WHERE "username" = 'alice' AND "status_date" = '2016-11-21' AND
"status_time" > '12:00:00';