Posts Tagged UTF-8

Retrieving UTF-8 values from Cassandra

I recently tried out Apache Cassandra which is a NoSQL solution that was initially developed by Facebook and designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure.

In order to populate the database, I used Apache Flume and Flume NG Apache Cassandra Sink which helped me to inject logs into it. But let’s focus on Cassandra here, I will write posts about Flume later on.

This is the Cassandra schema I was using (which is the one suggested by the Cassandra sink):

create keyspace logs with
   strategy_options = {datacenter1:1}
;

use logs;

create column family records with
   comparator = UTF8Type
   and gc_grace = 86400
;

After adding the data into the database, I wanted to fetch them to make sure everything went well.
I tried by three different ways:

  1. Cassandra CLI
    To return the first 100 rows (and all associated columns) from the records column family, I ran the following command:

    LIST records;

    However, the rows were looking like:

    RowKey: 6c6f67696b6465763a32653630306661362d633664652d346336612d386561352d323636326533353661616332
    => (column=data, value=39312e36362e3233392e323530202d202d205b32392f4465632f323031323a30393a31383a3433202d303730305d2022
    474554202f77702d636f6e74656e742f706c7567696e732f73796e746178686967686c6967687465722f73796e746178686967686c6967687465723
    22f736372697074732f636c6970626f6172642e73776620485454502f312e31222032303020313635392022687474703a2f2f7777772e6c6f67696b
    6465762e636f6d2f323031302f30372f30372f7573652d736572766572786d6c687474702d7468726f7567682d70726f78792f2220224d6f7a696c6
    c612f352e30202857696e646f7773204e5420362e313b20574f5736343b2072763a31372e3029204765636b6f2f3230313030313031204669726566
    6f782f31372e3022, timestamp=1357169131235135)
    => (column=host, value=39312e36362e3233392e323530, timestamp=1357169131235134)
    => (column=src, value=6c6f67696b646576, timestamp=1357169131235133)
    => (column=ts, value=323031322d31322d32395431363a31383a34332e3030305a, timestamp=1357169131235132) 
    

    This behavior is clearly explained on the DataStax page Getting Started Using the Cassandra CLI:

    Cassandra stores all data internally as hex byte arrays by default. If you do not specify a default row key validation class, column comparator and column validation class when you define the column family, Cassandra CLI will expect input data for row keys, column names, and column values to be in hex format (and data will be returned in hex format).
    To pass and return data in human-readable format, you can pass a value through an encoding function. Available encodings are:
    * ascii
    * bytes
    * integer (a generic variable-length integer type)
    * lexicalUUID
    * long
    * utf8

    Which means that we need to specify the encoding in which column family data should be returned. We can do it for the entire client session using the following commands:

    ASSUME records KEYS AS utf8;
    ASSUME records COMPARATOR AS utf8;
    ASSUME records VALIDATOR AS utf8;
    

    So if we now run the previous command, the rows are looking like:

    RowKey: logikdev:2e600fa6-c6de-4c6a-8ea5-2662e356aac2
    => (column=data, value=91.66.239.250 - - [29/Dec/2012:09:18:43 -0700] "GET /wp-content/plugins/syntaxhighlighter/syntax
    highlighter2/scripts/clipboard.swf HTTP/1.1" 200 1659 "http://www.logikdev.com/2010/07/07/use-serverxmlhttp-through-pro
    xy/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0", timestamp=1357169131235135)
    => (column=host, value=91.66.239.250, timestamp=1357169131235134)
    => (column=src, value=logikdev, timestamp=1357169131235133)
    => (column=ts, value=2012-12-29T16:18:43.000Z, timestamp=1357169131235132)
    

    Much better! πŸ˜‰

  2. CQL
    In order to retrieve columns in the records column family, we can use the following SELECT command:

    SELECT * FROM records LIMIT 1;

    However, the output looks like:

     key                                                                                        | column1 | value
    --------------------------------------------------------------------------------------------+---------+---------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    -------------
     6c6f67696b6465763a31656438616166362d386438322d343330372d386133612d336438363664633133323636 |    data | 3131352e313131
    2e33322e313730202d202d205b32392f4465632f323031323a30353a34303a3338202d303730305d2022474554202f323031302f30382f32362f76
    616c69646174652d66696c652d73697a652d7072696f722d75706c6f61642f20485454502f312e31222032303020393931302022687474703a2f2f
    7777772e676f6f676c652e636f2e696e2f75726c3f73613d74267263743d6a26713d26657372633d7326736f757263653d7765622663643d392676
    65643d3043484951466a41492675726c3d687474702533412532462532467777772e6c6f67696b6465762e636f6d25324632303130253246303825
    3246323625324676616c69646174652d66696c652d73697a652d7072696f722d75706c6f61642532462665693d764f5465554c7157443457717241
    665f6d4948774177267573673d4146516a434e4839385070715373655375304e4a6a46484f534d5576515642704e41266361643d726a612220224d
    6f7a696c6c612f352e30202857696e646f7773204e5420362e313b2072763a31372e3029204765636b6f2f32303130303130312046697265666f78
    2f31372e3022
    

    As you can see, we have a similar problem than in Cassandra CLI. And funny enough, the resolution is also similar except that the syntax is a bit different.
    Indeed, to treat the column values in the table records as being of the type UTF-8, we need to run the following commands:

    ASSUME records(key) VALUES ARE text;
    ASSUME records(value) VALUES ARE text
    

    First of all, note that we overrode the type as ‘text’ instead of UTF-8 as explained on this page.
    Secondly, the cqlsh help page says we can simply run the command ASSUME records VALUES are text; which is supposed to treat all column values in the table records as being of the type text. But, for some reason, this is not working and we have to run it for each column value.

  3. twitter/cassandra
    Finally, I tried this Ruby client (which can be found here) directly with the Interactive Ruby Shell (IRB).
    The first thing we have to do is to connect to Cassandra using the following commands:

    require 'cassandra'
    client = Cassandra.new('logs', '127.0.0.1:9160')
    

    We then can run the command below to get one of the rows:

    client.get(:records, 'logikdev:69a2ef29-599f-48f0-99ef-d4c2ad1f4169')
    

    This would return something like the following:

    => #"85.168.229.99 - - [29/Dec/2012:09:48:22 -0700] \"GET /wp-content/plugins/syntaxhighlighter/s
    yntaxhighlighter2/styles/shThemeDefault.css?ver=2.1.364 HTTP/1.1\" 200 3936 \"http://www.logikdev.com/2010/12/02/update
    -your-twitter-status-with-java/\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0\"", "host"
    =>"85.168.229.99", "src"=>"logikdev", "ts"=>"2012-12-29T16:48:22.000Z"}
    {"data"=>1357169131284005, "host"=>1357169131284004, "src"=>1357169131284003, "ts"=>1357169131284002}>
    

    And miracle!
    As you can see, there is no need to specify any type this time, the data are already decoded. πŸ™‚

EDIT:
After some more research, I found a better solution to our problem: you can define the validation class in the create statement of the column family. It means that, for our example, the Cassandra schema would be:

create keyspace logs with
   strategy_options = {datacenter1:1}
;

use logs;

create column family records with
   comparator = UTF8Type
   and gc_grace = 86400
   and key_validation_class = UTF8Type
   and default_validation_class = UTF8Type
;

With this solution, there is no need to use the ASSUME command, the data will be displayed properly.

, , , , , , , , ,

3 Comments

Locale settings for your cron job

Do you get special characters problem when executing your bash script from a cron job?
And does the same script work fine when it is directly executed from the command line?
If yes, continue reading this article! πŸ˜‰

The reason of this characters problem is probably because of your locale settings.
Indeed, If you try to run the command locale from the command line and from a cron job, you may get different results such as:

From the command line From a cron job
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

As you can see, the cron job is not using UTF-8. That must be the problem! πŸ™‚

So the question now is how to change the locale settings for the cron job?
Some people say that you need to add the following environment variables to the crontab entry:

SHELL=/bin/bash
LANG=en_US.UTF-8
LANGUAGE=en
LC_CTYPE=en_US.UTF-8

But this actually didn’t work for me. πŸ™

What you can do instead is create (if not already present) the file /etc/environment and add the following line:

LANG=en_US.UTF-8

The cron process will read this file when it starts, so you need to restart it in order to apply the change:

service cron restart

Hope this will fix your characters problem. πŸ˜‰

, , , , ,

18 Comments