HDB++ for cassandra

Hi Reynald,

Thanks for your inputs, We have successfully configured HDB++ for Cassandra after changing c++ cassandra drivers version.

But I have to change something in keyspace to make it working

Updated :
CREATE KEYSPACE IF NOT EXISTS hdb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

Original :
CREATE KEYSPACE IF NOT EXISTS hdb WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 };

Thanks,
Sandeep
Hi Sandeep,

Good news! smile
If you are playing and planning to play with only one Datacenter, I think it makes sense to use the SimpleStrategy indeed. If you plan to add a new datacenter at some point, you should use the NetworkTopologyStrategy as described in the Cassandra documentation (I guess you already saw that).

durable_writes default value is true so this part should not be necessary.

Please remember that HDB++ Cassandra is still under development (as you noticed the installation procedure should be simplified and some documentation will have to be written and published) and some coming changes may impact you.
I wouldn't recommend to use it in production right now.
An important change still needs to be implemented (partitioning per hour instead of per day). This will improve performances and improve the robustness of the system.
If you use the current version to store production data, you will need to convert the already stored data to the new partitioning system per hour when this one will be implemented.

Cheers,
Reynald
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
Sorry for replying to a very old post, but I found this and was curious about it since we are considering to deploy HDB++ with Cassandra at MAXIV.

You (Reynald) mention that you are changing the period to be per hour instead of per day. Is this change implemented in the latest version of HDB++? Otherwise, is it still on the roadmap?

I have been trying the latest version, but unfortunately I have not found a way to use it with Cassandra version 3.0, since HDB++ is not quite compatible with newer, required versions of the C++ driver. This is actually a regression, since the old version we have been testing with so far works fine with Cassandra 3.0, through C++ driver version 2.2.2. I guess you are not using Cassandra 3 and have not had reason to try newer library versions?

Cheers,
Johan
johfor
You (Reynald) mention that you are changing the period to be per hour instead of per day. Is this change implemented in the latest version of HDB++? Otherwise, is it still on the roadmap?

Hi,
This change is not implemented yet in the latest version of HDB++.
It is still on the roadmap because the fact to get partitions of one day per attribute can cause some troubles with Cassandra (partitions getting too big) especially if you are receiving several events per second for some attributes.
The best would be to find a way to configure this per attribute.

johfor
I have been trying the latest version, but unfortunately I have not found a way to use it with Cassandra version 3.0, since HDB++ is not quite compatible with newer, required versions of the C++ driver. This is actually a regression, since the old version we have been testing with so far works fine with Cassandra 3.0, through C++ driver version 2.2.2. I guess you are not using Cassandra 3 and have not had reason to try newer library versions?

Hi, I'm surprised the old HDB++ version is compatible with Cassandra 3.0 but not the newer.
In any case, as you guessed, we did not try with Cassandra 3.0 yet because we thought it was still a bit early to move to a new major release version of Cassandra.
Our experience showed that it is better to wait some time for the bugs to be corrected in the new Cassandra versions.
Especially with Cassandra 3.0 where there is a new storage engine.
We are using Cassandra 2.2(.4) and HDB++ should work with Cassandra 2.2.x versions.

Cheers,
Reynald
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
Reynald
Hi,
This change is not implemented yet in the latest version of HDB++.
It is still on the roadmap because the fact to get partitions of one day per attribute can cause some troubles with Cassandra (partitions getting too big) especially if you are receiving several events per second for some attributes.
The best would be to find a way to configure this per attribute.

OK, that is good to know. Any guess to the possible time frame on this?

Is there a significant cost to making the period shorter? I suppose querying over lots of periods might be a bit slower..?

Reynald
Hi, I'm surprised the old HDB++ version is compatible with Cassandra 3.0 but not the newer.
In any case, as you guessed, we did not try with Cassandra 3.0 yet because we thought it was still a bit early to move to a new major release version of Cassandra.
Our experience showed that it is better to wait some time for the bugs to be corrected in the new Cassandra versions.
Especially with Cassandra 3.0 where there is a new storage engine.
We are using Cassandra 2.2(.4) and HDB++ should work with Cassandra 2.2.x versions.

I see, and this makes sense, however as far as I can see, DataStax are now using Cassandra 3.0 as part of their "enterprise" product and so I assume it's considered stable for production now. Perhaps we can have a look at fixing the driver incompatibility, since we may be the only ones who care right now :)

Thanks for your reply,
Johan
johfor
Reynald
Hi,
This change is not implemented yet in the latest version of HDB++.
It is still on the roadmap because the fact to get partitions of one day per attribute can cause some troubles with Cassandra (partitions getting too big) especially if you are receiving several events per second for some attributes.
The best would be to find a way to configure this per attribute.

OK, that is good to know. Any guess to the possible time frame on this?

Difficult to say right now… I'm being quite busy with the Tango kernel (and other stuff!). We are hiring someone who will work on HDB++ soon but I suspect this change won't be released before several months.

johfor
Is there a significant cost to making the period shorter? I suppose querying over lots of periods might be a bit slower..?

Yeah, the main drawback is the fact we will have to execute many more queries, even if there is not so much data in the partitions we are querying… In terms of performances, it might be a bit slower but maybe not that much actually… This has to be evaluated.
If we have attributes which are sending archive events once every hour, we will have partitions containing only 1 row, which is clearly sub-optimal (even a partition per day in this case is a bit overkill).
The best would be if Cassandra would be able to adapt the partitions size itself but this is not the case.
So this is not so easy to handle in our case because we are based on events. And there could be some periods where many events will be sent in the same second while on some other periods, only a few events per day will be sent.
I think there is room for improvement in Cassandra itself because it complains in the logs when there are too big partitions but it does nothing else about it.
I guess there must be a way to improve HDB++ to find a way to adapt automatically to create reasonable size partitions. Using Spark to re-organize the data in optimal partition sizes and add information about what are the partitions available for a given attribute in a given period would be an idea for instance.
But there must be much cleverer ideas… smile

johfor
I see, and this makes sense, however as far as I can see, DataStax are now using Cassandra 3.0 as part of their "enterprise" product and so I assume it's considered stable for production now. Perhaps we can have a look at fixing the driver incompatibility, since we may be the only ones who care right now :)

Thanks for your reply,
Johan

Any contribution is welcome! smile
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
Agreed, it would be nice to be able to tune the partition size per-attribute.

I suppose if you were to somehow dynamically change the size of partitions it would also mean that the way partition keys work would need to change, since now it assumes that you can know beforehand the key of the partition your data is in.

The Cassandra docs mention 100.000 points/100MB as "rule of thumb" maximum partition sizes, but it seems like these numbers are really limitations from pre version 2.X days. I also see that 3.0 has undergone a lot of changes to how rows and partitions are stored on disk, it might be interesting to compare.
johfor
Agreed, it would be nice to be able to tune the partition size per-attribute.

I suppose if you were to somehow dynamically change the size of partitions it would also mean that the way partition keys work would need to change, since now it assumes that you can know beforehand the key of the partition your data is in.

Exactly. If noting is done on the Cassandra side, we would need to find a way to tell the clients what are the available partitions for the attribute and period of time they want to retrieve.

johfor
The Cassandra docs mention 100.000 points/100MB as "rule of thumb" maximum partition sizes, but it seems like these numbers are really limitations from pre version 2.X days. I also see that 3.0 has undergone a lot of changes to how rows and partitions are stored on disk, it might be interesting to compare.

Might be interesting indeed.
FYI, another improvement which is planned in HDB++/Cassandra and which will reduce the partitions size will be to move the timestamps which are used for diagnostics (recv_time, recv_time_us, insert_time and insert_time_us) to another optional table.
Rosenberg's Law: Software is easy to make, except when you want it to do something new.
Corollary: The only software that's worth making is software that does something new.
Hi Reynald,

I need to clarify my understanding regarding HDB++ Event Subscribers connection with Cassandra.

In the Event Subscriber class properties, we specify the DbHost, DbName and DbPort. This implies that all the Event Subscribers deployed in a single Tango Facility will use these class properties and connect with the specified Cassandra Node of a Cluster.

Suppose a Cassandra cluster comprises of two nodes, viz. Cassandra Node A and Cassandra Node B. The Cassandra Nodes are deployed on different machines. I want to distribute the HDB++ archiving operations between Cassandra Nodes, i.e. some Event Subscribers will write to Cassandra Node A while other Event Subscribers will write to Cassandra Node B. For this, I defined the DbHost, DbName and DbPort as Device properties which override the Class properties.

Please correct me if I'm wrong.

Kind regards,
Jyotin
Edited 6 years ago
Hi Jyotin,
you're right, the Device properties, when defined, will override the Class properties.
Cheers,
Lorenzo
 
Register or login to create to post a reply.