Category Archives: Knowledge

Oracle Database INTERVAL Partitioning and Virtual Columns

Get Social!

oracle-logoI bumped into this issue recently which prevents you INSERTING a date of 31-12-9999 into an INTERVAL partitioned table due to the fact that the LESS THAN partition value would be larger than an Oracle Date.

After speaking with Jeff Moss he mentioned that we might be able to get round the problem by using a virtual column on the table and partition on that. I didn’t find a satisfactory solution using this method, but understanding some of the optimisers quirks and abilities when using partitions and virtual columns was rather interesting.

Starting with the following table structure:

CREATE TABLE interval_test
  data_from_date   DATE NOT NULL
, data_to_date     DATE
, business_key     NUMBER
  PARTITION p_interval_test_0 VALUES LESS THAN (TO_DATE('2000-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))

The purpose here is to create a tuple timestamped table for storing historical and current data. The data_from_date represents the date the row is known to be valid from and the data_to_date is when the row is known to changed or been removed entirely from the source system. The current row, the one that is still available in the source system, would not have an ‘end date’ and therefore would either be NULL or a made up ‘high’ date, such as DATE 9999-12-31.

And this is where the problems start.

Winding back a little, and forgetting the high date scenario for a moment, let’s take a look at how queries use the partitions. Generally, queries would use a specific date that’s between the data_from_date and data_to_date to get a view of the data for a point in time.

FROM interval_test 
WHERE DATE '2001-01-01' BETWEEN data_from_date AND data_to_date
| Id  | Operation                | Name          | Time     | Pstart| Pstop |
|   0 | SELECT STATEMENT         |               | 00:00:01 |       |       |
|   1 |  PARTITION RANGE ITERATOR|               | 00:00:01 |    14 |1048575|
|*  2 |   TABLE ACCESS FULL      | INTERVAL_TEST | 00:00:01 |    14 |1048575|

As you can see, the partitions are being pruned where possible, starting at partition 14. Side note: because this table is INTERVAL partitioned and there is no maximum value set for our data_to_date Oracle may have to look at all the partitions after partition 14. Potentially, this could be all the way up to 1048575 which is the maximum partition number that could exist. It’s a shame the optimiser isn’t a little more intelligent here – I know there aren’t 1048575 partitions and so should the optimiser!

SELECT COUNT(*) FROM dba_tab_partitions WHERE table_name = 'INTERVAL_TEST';


Anyway, back to the matter at hand. Inserting a high date into this table results in an error (described further here).

ORA-01841: (full) year must be between -4713 and +9999, and not be 0

A potential solution to the problem was to add a virtual column to the table that translates the high date into something that would go into a monthly partition. There are a bunch of options, to name a few:

  • TRUNC the date to ‘MM’ and minus 1
  • Minus 31 away from the date
  • TRUNC the date to ‘IQ’ – this would give quarterly partitions and not monthly partitions, but let’s not rule it out.
  • TO_CHAR the date to a YYYYMM, then CAST that as a NUMBER and use numbered partitioning.

I’m not going to explore all of these options and the set up script for you to test further scenarios is here. Note the test SQL will be:

SELECT * FROM interval_test WHERE DATE '2001-01-01' BETWEEN data_from_date AND data_to_date

Truncate the month and minus 1

This was my first test as this would give me the previous months value to partition on which would nicely fit in without causing the ORA-01841 error. The virtual column expression is as follows:

, partition_column AS (TRUNC(data_to_date, 'MM') - 1)

The result wasn’t a success with a query plan of:

| Id  | Operation           | Name          | Time     | Pstart| Pstop |
|   0 | SELECT STATEMENT    |               | 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL|               | 00:00:01 |     1 |1048575|
|*  2 |   TABLE ACCESS FULL | INTERVAL_TEST | 00:00:01 |     1 |1048575|

Minus 31

Next I simplified the expression by simply minusing 31 days from the date being inserted. This removes the TRUNC ‘complication’ to produce a less complex expression.

, partition_column AS (data_to_date -31)

Again, the result was a failure with a query plan of:

| Id  | Operation           | Name          | Time     | Pstart| Pstop |
|   0 | SELECT STATEMENT    |               | 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL|               | 00:00:01 |     1 |1048575|
|*  2 |   TABLE ACCESS FULL | INTERVAL_TEST | 00:00:01 |     1 |1048575|

Does it ever work?

Yes! The following expressions work, but as you’ll notice have mixed results in solving our initial problem. It seems as soon as any math operator is specified that the optimiser cannot work out that partition pruning can occur. Using TRUNC however, which to me seems a more complex function, works perfectly. Interestingly when it does work, the filter section of the query plan shows an additional predicate that your query doesn’t specify but that is used to partition prune.

filter("DATA_FROM_DATE"<=TO_DATE(' 2001-01-01 00:00:00',
'syyyy-mm-dd hh24:mi:ss') AND "DATA_TO_DATE">=TO_DATE(' 2001-01-01
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
00:00:00', 'syyyy-mm-dd hh24:mi:ss'),'IQ'))

Here are some of the virtual columns that do partition prune however, might not solve the original issue.

, partition_column AS (TRUNC(data_to_date, 'IQ'))

| Id  | Operation                | Name          | Time     | Pstart| Pstop |
|   0 | SELECT STATEMENT         |               | 00:00:01 |       |       |
|   1 |  PARTITION RANGE ITERATOR|               | 00:00:01 |   KEY |1048575|
|*  2 |   TABLE ACCESS FULL      | INTERVAL_TEST | 00:00:01 |   KEY |1048575|
, partition_column AS (TRUNC(data_to_date)) 

| Id  | Operation                | Name          | Time     | Pstart| Pstop |
|   0 | SELECT STATEMENT         |               | 00:00:01 |       |       |
|   1 |  PARTITION RANGE ITERATOR|               | 00:00:01 |   KEY |1048575|
|*  2 |   TABLE ACCESS FULL      | INTERVAL_TEST | 00:00:01 |   KEY |1048575|


Oracle Database INTERVAL Partitioning Date Error With 31-12-9999

Get Social!

oracle-logoYou may have seen the below error when using dates at the extremes with an Oracle Database. This post focuses on the scenario where you’re inserting a high date, such as 31-12-9999 into an INTERVAL partitioned table. High dates such as this are often used in warehouse environments where from and to dates are specified to denote a period of time the data was valid for – a high date specifying the current data row.

Take the following table and partition definition:

CREATE TABLE interval_date_test
  my_date   DATE NOT NULL
  PARTITION p_interval_date_test0 VALUES LESS THAN (TO_DATE('2016-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))

Each row inserted will go into a partition for that year-month combination. This can be useful for high volume data that’s generally queried for one-or-so months at a time.

If, however, you try to insert the following row, you’ll get an error:

INSERT INTO interval_date_test (my_date) VALUES(DATE '9999-12-31');
INSERT INTO interval_date_test (my_date) VALUES(DATE '9999-12-31')
ERROR at line 1:
ORA-01841: (full) year must be between -4713 and +9999, and not be 0

This is due to the way the Oracle Database defines a partition. An INTERVAL RANGE partition in an Oracle Database is defined as accepting dates LESS THAN a specific value. The below example would store any dates up to the clock ticking over into May 2016.


When partitioned monthly, the LESS THAN value for inserting DATE ‘9999-12-31’ would have to be DATE ‘10000-01-01’ and that is not a valid date value. Oracle Database dates must be between the year -4713 and +9999 inclusive as explained by the error.

The solution? Either use a non-INTERVAL partition strategy, reduce the date being INSERTed or experiment with a Virtual Column and partition on that.

Cassandra Firewall Ports

Category : Knowledge

Get Social!

datastax-logoCassandra uses various ports for communication, either server to server communication or client to server communication.

If you’re using Cassandra in a firewalled environment you may need to open various ports depending on your needs. All ports are TCP and can be configured in the relevant config files to use non-default ports.

Port # Type Description
7000 TCP Non-encrypted inter-node cluster communication. Not used if SSL is in use.
7001 TCP Encrypted SSL inter-node cluster communication. Not used if SSL is not in use.
7199 TCP JMX monitoring port.
9042 TCP Client port used for native CQL.
9160 TCP Client port used for Thrift.

Where to Run Ceph Processes For High Availability

Tags :

Category : Knowledge

Get Social!

ceph-logoIn this blog post we’re going to take a detailed look at the Ceph server processes within a Ceph cluster of multiple server nodes.

Ceph is a highly available network storage layer that uses multiple disks, over multiple nodes, to provide a single storage platform for use over a network. Ceph is designed to be fault tolerant to ensure access to data is always available.

A Ceph cluster consists of 4 components:

  1. Monitor – is the daemon that holds the cluster map listing all the available nodes and Ceph daemons that are available for use by the cluster.
  2. OSD – is the daemon that handles the reading and writing of data to a physical disk. One OSD process will run per storage medium (usually a single disk device such as /dev/sdb) attached to the cluster. For example, a single server with 3 hard disks attached for use by the Ceph cluster will run 3 OSD processes.
  3. Metadata – is only required is Ceph will be used as a CephFS, that is a network filesystem. The Metadata, or MDS, daemon contains metadata on files and directories such as size and directory hierarchy so that Ceph can be used as a POSIX compliant filesystem. There can only be one active MDS daemon at any one time, but at least one secondary passive MDS daemon is advised to provide fault tolerance.
  4. Client – is used to read and write to the Ceph cluster. It allows the Linux kernel to mount a Ceph cluster as a filesystem, use Ceph as an object store or use Ceph as a block device.

Ceph Process Architecture

The Ceph storage daemons can run on dedicated hardware to provide a highly available file storage layer to remote clients over a network. This configuration is similar to a typical file server where the hardware is dedicated to only serving data. Extremely large or high performance storage requirements would usually be configured in this way to ensure there is sufficient resource (such as CPU, RAM and disk) to meet the requirements of the cluster. Ceph can also be used, in smaller and less demanding environments, to run alongside existing applications to maximise the use of existing hardware and reduce costs. This is one of the recommended setups for Proxmox VE, for example. The below example shows a 4 server cluster of Ceph daemons, suitable for either of the above two scenarios.

Ceph Cluster Process Overview

Server1 has 4 OSD processes running, which would indicate there are 4 physical disks used by Ceph on the server. The monitor daemon ensures that available nodes and daemons are tracked so that requests for I/ O can be served by active nodes. Finally, the optional MDS daemon is available and running for metadata requests from a Ceph filesystem. There can only be one active MDS daemon for a cluster, therefore server one hosts the primary MDS daemon.

Server2 is the same as Server1 however it’s MDS process is in standby mode. As only one MDS daemon can be answering requests at any given time, the MDS daemon runs in Active/ Passive mode. If the Active daemon fails, it takes around 30 seconds for the Passive daemon to become Active. This can present a small period of downtime for metadata requests. Future versions of Ceph will change this behaviour to create a more redundant process.

Server3 does not have any MDS component because there are already two nodes hosting an active and standby MDS process.

Server4 only hosts OSD processes. All other processes are already highly available and with a node of this size there is plenty of redundancy.

Scaling Ceph


Ceph is very modular by design, with each process having a specific task and talking to other processes over the network. Adding new servers is something that can be done easily, and without downtime to the existing storage pool. Further storage servers would look like either Server3 or Server4.

Generally additional servers are added so that more disks can be added to the cluster, and therefore it’s easy to understand why additional OSD processes would be added. In addition to adding additional servers (horizontal scaling), if existing servers have sufficient resource, further OSD processes could be added to the existing servers (vertical scaling). Adding additional OSD processes could be to either increase the overall storage pool size, increase the replication factor or to increase available bandwidth.

Mon processes, however, are not required on all servers and are only needed in multiples to prevent a failure causing downtime. Ceph requires a majority of monitors running to establish a quorum therefore monitors should always exist in odd numbers. To explain a little further; more monitor processes must be available then failed, for a cluster. So in a cluster with 10 monitor processes only 4 could fail before causing a problem, whereas with 11 monitor processes 5 could fail. Generally 3 mon processes will be sufficient for common Ceph cluster requirements.


Ceph should be provisioned to provide at least the minimum availability you need from your storage pool and the maximum risk you can afford to data loss. Each component must be balanced; it’s no good having a fully fault tolerant suite of monitor processes if you’ve not enabled data replication.

My final point is to plan for cluster maintenance as well as failures. Throughout a cluster’s lifetime, it’s reasonable to expect that individual servers will be taken out of service for short periods of time. For example, applying Ceph updates requires that the components are restarted, or applying kernel updates often results in a complete restart of the host taking several minutes to complete. During this time your cluster will be in a degraded state as services required by the cluster are offline. Ceph is designed to work in this scenario (depending on configuration) however if, for example, you only have 3 mon processes and you take the server down running one of them to apply a kernel update then the cluster cannot function if another mon process fails due to other reasons.

qcow2 Disk Images and Performance

Get Social!

qcow2 is a virtual disk image format developed by the guys who created QEMU and is one of the most versatile virtual disk formats available. It’s the default and preferred virtual disk format for the Proxmox VE hypervisor and should be used for most virtual machines.

qcow2 offers the following features :

  • Sparse space allocation which means that the entire virtual disk size doesn’t need to be allocated on the hard drive when it’s created. Only the physical space needed by actual data stored to the virtual disk is required.
  • Snapshots can be stored and rolled back to thanks to the copy-on-write process which is used to write to qcow2 files.
  • Linked or chained files can be used. For example, a read only base file could be used to hold ‘system’ files (a gold plate image, if you will), and any changes could be written to an additional file leaving the original intact and unchanged. Multiple machines could use this base file at once, therefore reducing space requirements.
  • AES encryption can be used to encrypt all data at rest.
  • Compression, based on zlib, to reduce physical space requirements and reduce read bytes.

Because of all these features, qcow2 files have a processing overhead, when compared to raw files, in that any data read or written to a qcow2 virtual disk would have to go through a process that could slow the read or write operations. This means there is an overhead associated with IO operations on qcow2 files, again, compared to raw type storage that we have to consider when deciding which features to use.

Increase qcow2 Performance

Sparse Space Allocation

Anything stored on a virtual disk has to be, at some point, stored on a physical medium such as a hard disk. In addition to the data, a virtual disk has a small amount of metadata associated with it that is usually stored in the same file. For example, a virtual disk has no physical constraint on how large it can be, like a hard disk, and therefore this is one of the bits of data we need to store in the qcow2 file.

In addition to that, and just like a physical hard drive, data in a qcow2 file is stored in blocks or clusters and a lookup is required to determine what data is in which cluster. Think of this as a shelf full of numbered boxes, and having a book (or index) which tells you what each box number contains. All of this cluster information is also stored within the qcow2 file consuming disk space that is relative to the data capacity of the qcow2 file. For example, a qcow2 file that can store 1GB of data would have a much smaller metadata footprint than a qcow2 file that can store 100GB of data.


Anyway, back to sparse files. The idea of a sparse file is to remove the need to allocate the full size of the file to a physical disk. I can, for example, create a qcow2 image with a data capacity of 10GB that will take up just several KBs of physical space until data is saved to the qcow2 image. As data is saved to the qcow2 image, the physical space used by the image will increase (the data has to be stored somewhere, right?). In addition, as will the metadata because each new cluster that’s required by the qcow2 file will have it’s own entry in the metadata section of the file.

qemu-img comes with various options for setting the allocation when creating new disk images.

  • preallocation=metadata – allocates the space required by the metadata but doesn’t allocate any space for the data. This is the quickest to provision but the slowest for guest writes.
  • preallocation=falloc –  allocates space for the metadata and data but marks the blocks as unallocated. This will provision slower than metadata but quicker than full. Guest write performance will be much quicker than metadata and similar to full.
  • preallocation=full – allocates space for the metadata and data and will therefore consume all the physical space that you allocate (not sparse). All empty allocated space will be set as a zero. This is the slowest to provision and will give similar guest write performance to falloc.

Example command:

qemu-img create -f qcow2 -o preallocation=falloc image.qcow2 1G

The performance impact here is when the virtual image needs to grow in order to store new information written to it. For each new write a new cluster will need to be provisioned and a metadata index entry referencing the new cluster. Depending on the above option selected, the OS may have to allocate a new sector for both the index and the data cluster incurring a performance penalty. Once the disk has been expanded (e.g. or preallocation=full) then there is no penalty on assigning a new cluster as all the clusters are already assigned and available.

See qcow2 preallocation for some examples and benchmarks of the above attributes.


qcow2 images are not encrypted by default, so not using encryption couldn’t be more simple. Of course, your data will not be encrypted (unless you use some other process on top of the virtual storage layer) but you’ll save all those CPU cycles when reading and writing the data.


qcow2 is, at best, a bit weird when it comes to compression (encryption works the same way, too!) in that compression is a one time event, or process that you run to compress an existing image. Any data written after this will be stored uncompressed.

The next thing is to understand compression itself – compression (under the right circumstances) will reduce the size of the data stored on disk at the expense of CPU to compress (one off) and decompress (every time the data is accessed) the data. In certain circumstances, compression can result in a quicker read for the process consuming the data, such as where CPU is abundant and IO bandwidth is very small.

As always, testing your scenarios is the best way to understand the impact.

My Proxmox VE Bash Scripts on GitHub

Category : Knowledge

Get Social!

octocat-githubI’ve created a public repository on GitHub with few Bash scripts which may (or may not) be helpful for anyone administering a Proxmox VE Server.

I’ll add to this repository over time with any new scripts I write. Let me know if they are useful, or if you have any of your own in the comments.

Currently the scripts are:

  • backup-all – will backup all OpenVZ Containers on the Proxmox cluster.
  • ct-disk-space – a print out of the diskspace used by all OpenVZ Containers on the server.
  • migrate – will migrate a Container to a different storage ID.
  • restore-all -will restore all OpenVZ Containers on a cluster to the latest available backup.

Link to GitHub proxmox-scripts repository:

Visit our advertisers

Quick Poll

Do you use GlusterFS in your workplace?

Visit our advertisers