Category Archives: How-to

Proxmox OpenVZ SWAP and Performance

Get Social!

openvz-logo-150px_new_3 I have been having trouble with a Proxmox node which is only running OpenVZ containers however it is at the upper limit of its RAM resources. Over time, I noticed that Proxmox used SWAP (virtual memory, page file, etc), quite aggressively, to make sure there was always some RAM free. That sounds fantastic, and is just what I would expect the Proxmox server to be doing, except it does it all too well. Proxmox made sure that around 40% of the RAM on the host machine was free at the expense of moving many running processes across all the running containers to SWAP. This is how Linux works, by design, and is expected behaviour. Running processes which have memory which hasn’t been touched in a while have their memory moved to SWAP. This allows other applications which need the memory right now to use it and anything left over can be used as cache by the kernel. When a process with memory in SWAP needs to use that memory, it needs to be read from SWAP and back into memory so that it can be used. There is a huge overhead with this process and will often be noticed when you use a container which has not been used in a while – at the start everything will be slow until all the required memory items have been read from SWAP and put back into RAM. To help with this situation we can do two things:

  • Make sure SWAP is always on a fast disk with plenty of free IO bandwidth. On a small installation, this should not be on the same disk as your container file systems. SSDs can also bring a huge performance benefit over conventional mechanical drives.
  • Reduce the amount of RAM which Proxmox keeps free by making the algorithm which moves memory to SWAP less aggressive.

Move SWAP to fast storage

Generally, and when installing Proxmox for the first time a SWAP partition will be created on your hard disk. By default, this will be the same partition as your Proxmox operating system and your container storage. On a slow mechanical disk, this will result in far too much IO concurrency – that is different processes trying to read or write to a disk at the same time – which will massively affect server performance. One thing we can move to another disk is system wide swap.

You can either use a new file, disk, partition or block device for your new swap location. You will then need to turn your old SWAP device off to stop it from being used. Use the below examples to move your SWAP device.

See this post for a quick script to automatically create a SWAP file.

Make a new SWAP device as a file

Create a file on your file system and enable it to be used as a SWAP device. The below example uses the mount /mnt/swapdrive and the file swapfile to use as your new swap device with a size of 4096 MB.

dd if=/dev/zero of=/mnt/swapdrive/swapfile bs=1M count=4096

You will then need to format the file as SWAP with the below command.

mkswap /mnt/swapdrive/swapfile

Make a new SWAP device as a partition

Use the below command to use a drive partition as your new SWAP device. The below example uses /dev/sdc3 as your SWAP partition. You must have precreated this partition for it to be available.

mkswap /dev/sdc3
swapon /dev/sdc3

Turn a new SWAP device on

Once you have a new SWAP device created, either a file or a disk or partition you will need to enable it. Use the swapon command. The below shows an example of a file and disk partition command:

swapon /mnt/swapdrive/swapfile
swapon /dev/sdc3

Turn off the old SWAP device

To turn off the old SWAP device, first identify it using swapon -s.

swapon -s

Then, use the swapoff command to turn the device off. The below example is the default Proxmox SWAP device location.

swapoff /dev/mapper/pve-swap

Clear SWAP space without rebooting

You can clear your SWAP memory by turning the system wide SWAP memory off and then back on again. Run the below commands to turn off your system wide SWAP space forcing all the SWAP to be read back into RAM. You must have enough RAM for available on your system for this to work correctly. Once this has completed, run the second command to turn SWAP back on again. You can also use this to make your SWAP memory changes take effect.

swapoff -a 
swapon -a

Make the SWAP file persist after rebooting

To make sure your SWAP file is mounted the next time your machine reboots you’ll need to add an entry to the fstab file.

Open the fstab file with your text editor:

vi /etc/fstab

And add a line, similar to the below making sure the first attribute is the location of your newly created SWAP file.

/mnt/swapdrive/swapfile  swap  swap  defaults  0  0

Change the ‘swapiness’ setting

To change how aggressively Proxmox, or other Linux distribution, moves process memory to SWAP we have a swapiness attribute. The swapiness setting is a kernel setting which is permanently set in the /etc/sysctl.conf file, or temporarily using sysctl.

The swapiness setting takes a value between 0 and 100. Using 0 will virtually turn off using SWAP, except to avoid an out of memory exception (oom). Using a value of 100 will cause the system to use SWAP as often as possible and will likely degrade system performance servilely. A value of 60 is the default for Proxmox.

Change the swapiness value for the current boot

To change your swapiness value for the current boot, use the below command. The value will be reset after rebooting. The following example will set the swapiness value to 20.

sysctl -w vm.swappiness=20

Permanently change the swapiness value

Use the below command to permanently change your swapiness value. Note that this will not affect the current boot.

vi  /etc/sysctl.conf

And add the following to give a swapiness of 20

vm.swappiness=20

My experience with GlusterFS performance.

Category : How-to

Get Social!

gluster-orange-antI have been using GlusterFS to replicate storage between two physical servers for two reasons; load balancing and data redundancy. I use this on top of a ZFS storage array as described in this post and the two technologies combined provide a fast and very redundant storage mechanism. At the ZFS layer, or other filesystem technology that you may use, there are several functions that we can leverage to provide fast performance. For ZFS specifically, we can add SSD disks for caching, and tweak memory settings to provide the most throughput possible on any given system. With GlusterFS we also have several ways to improve performance but before we look into those, we need to be sure that is it the GlusterFS layer which is causing the problem. For example, if your disks or network is slow, what chance does GlusterFS have of giving you good performance? You also need to understand how the individual components work under the load of your expected environment. The disks may work perfectly well when you use dd to create a huge file, but what about when lots of users create lots of files all at the same time? You can break down performance into three key areas:

  • Networking – the network between each GlusterFS instance.
  • Filesystem IO performance – the file system local to each GlusterFS instance.
  • GlusterFS – the actual GlusterFS process.

Networking Performance

Before testing the disk and file system, it’s a good idea to make sure that the network connection between the GlusterFS nodes is performing as you would expect. Test the network bandwidth between all GlusterFS boxes using Iperf. See the Iperf blog post for more information on benchmarking network performance. Remember to test the performance over a period of several hours to minimise the affect of host and network load. If you make any network changes, remember to test between each change to make sure it has had the desired effect.

Filesystem IO Performance

Once you have tested the network between all GlusterFS boxes, you should test the local disk speed on each machine. There are several ways to do this, but I find it’s best to keep it simple and use one of two options; DD or bonnie++. You must be sure to turn off any GlusterFS replication as it is just the disks and filesystem which we are trying to test here. Bonnie++ is a freely available IO benchmarking tool.  DD is a linux command line tool which can replicate data streams and copy files. See this blog post for information on benchmarking the files system.

Technology, Tuning and GlusterFS

Once we have made it certain in our minds that disk I/O and network bandwidth are not the issue, or more importantly understood what constraints they give you in your environment, you can tune everything else to maximise performance. In our case, we are trying to maximise GlusterFS replication performance over two nodes.

We can aim to achieve replication speeds nearing the speed of the the slowest performing speed; file system IO and network speeds.

See my blog post on GlusterFS performance tuning.


GlusterFS performance tuning

Category : How-to

Get Social!

gluster-orange-antI have been using GlusterFS to provide file synchronisation over two networked servers. As soon as the first file was replicated between the two nodes I wanted to understand the time it took for the file to be available on the second node. I’ll call this replication latency.

As discussed in my other blog posts, it is important to understand what the limitations are in the system without the GlusterFS layer. File system and network speed need to be understood so that we are not blaming high replication latency on GlusterFS when it’s slow because of other factors.

The next thing to note is that replication latency is affected by the type of file you are transferring between nodes. Many small files will result in lower transfer speeds, whereas very large files will reach the highest speeds. This is because there is a large overhead with each file replicated with GlusterFS meaning the larger the file the more the overhead is reduced when compared to transferring the actual file.

With all performance tuning, there are no magic values for these which work on all systems. The defaults in GlusterFS are configured at install time to provide best performance over mixed workloads. To squeeze performance out of GlusterFS, use an understanding of the below parameters and how them may be used in your setup.

After making a change, be sure to restart all GlusterFS processes and begin benchmarking the new values.

GlusterFS specific

GlusterFS volumes can be configured with multiple settings. These can be set on a volume using the below command substituting [VOLUME] for the volume to alter, [OPTION]  for the parameter name and [PARAMETER] for the parameter value.

gluster volume set [VOLUME] [OPTION] [PARAMETER]

Example:

gluster volume set myvolume performance.cache-size 1GB

Or you can add the parameter to the glusterfs.vol config file.

vi /etc/glusterfs/glusterfs.vol
  • performance.write-behind-window-size – the size in bytes to use for the per file write behind buffer. Default: 1MB.
  • performance.cache-refresh-timeout – the time in seconds a cached data file will be kept until data revalidation occurs. Default: 1 second.
  • performance.cache-size – the size in bytes to use for the read cache. Default: 32MB.
  • cluster.stripe-block-size – the size in bytes of the unit that will be read from or written to on the GlusterFS volume. Smaller values are better for smaller files and larger sizes for larger files. Default: 128KB.
  • performance.io-thread-count – is the maximum number of threads used for IO. Higher numbers improve concurrent IO operations, providing your disks can keep up. Default: 16.

Other Notes

When mounting your storage for the GlusterFS later, make sure it is configured for the type of workload you have.

  • When mounting your GlusterFS storage from a remote server to your local server, be sure to dissable direct-io as this will enable the kernel read ahead and file system cache. This will be sensible for most workloads where caching of files is beneficial.
  • When mounting the GlusterFS volume over NFS use noatime and nodiratime to remove the timestamps over NFS.

I haven’t been working with GlusterFS for long so I would be very interested in your thoughts on performance. Please leave a comment below.


Remove the Proxmox “No Valid Subscription” message

Category : How-to

Get Social!

proxmox logo gradProxmox 3.1 has implemented a new repository setup, as described in my recent blog post.

Each time you log into Proxmox 3.1 a dialogue box pops up with the message:

You do not have a valid subscription for this server. Please visit www.proxmox.com to get a list of available options.

One way to remove the message is to purchase a subscription from the Proxmox team. Remember that paying subscriptions keeps the development of Proxmox progressing. For the recent release, the subscription cost has been heavily reduced and is more affordable than ever.

The fact of the matter is, I started using Proxmox as a free and open source tool and expected it to stay that way. Had I known a subscription element would have been introduced, I would likely have chosen another toolset. As it is, I am too invested in Proxmox (time-wise) and changing to another technology is simply out of the question at this point. This brings me onto the other method; make a slight change to the code to remove the dialogue box from appearing. This is allowed under the License (aGPLv3) used for Proxmox however future updates may break your code and you may have to re-apply it or apply a different change.

You will need SSH access to the Proxmox host with the required access to edit the pvemanagerlib.js file.

First, take a backup of the file:

cp /usr/share/pve-manager/ext4/pvemanagerlib.js /usr/share/pve-manager/ext4/pvemanagerlib.js_BKP

Then open the file using a text editor, vi for example.

vi /usr/share/pve-manager/ext4/pvemanagerlib.js

Currently on line 519 of the file, however it may change with future updates, there is a line similar to below;

if (data.status !== 'Active') {

This line is doing the check to see if your subscription status is not ‘Active’. This needs to be changed to return false to stop the subscription message from being shown.

if (false) {

And that annoying little popup will be a thing of the past!

Note: You may need to clear your web browser cache after applying this code change.

I have added this code to my Proxmox patch – see this blog post for more information.


Benchmark disk IO with DD and Bonnie++

Get Social!

Benchmarking disk or file system IO performance can be tricky at best. The problem is that modern file systems leverage various techniques to ensure that the best performance is achieved such as caching files in RAM. This means that unless you circumvent the disk cache, your reported speeds will be reporting how quickly the files can be read from memory.

In this example, I’ll cover benchmarking a Linux file system using two methods; dd for the easy route, and bonnie++ for a more comprehensive test.

dd

Write

You can use dd to create a large file as quickly as possible to see how long it takes. It’s a very basic test and not very customisable however it will give you a sense of the performance of the file system. You must make sure this file is larger than the amount of RAM you have on your system to avoid the whole file being cached in memory.

It’s usually installed out-of-the-box with most Linux file systems which makes it an ideal tool in locked-down environments or environments where it’s tricky to get packages installed onto. Use the below command substituting [PATH] with the filesystem path to test, [BLOCK_SIZE] with the block size and [LOOPS] for the amount of blocks to write.

time sh -c "dd if=/dev/zero of=[PATH] bs=[BLOCK_SIZE]k count=[LOOPS] && sync"

A break down of the command is as follows:

  • time – times the overall process from start to finish
  • of= this is the path which you would like to test. The path must be read/ writable.
  • bs= is the block size to use. If you have a specific load which you are testing for, make this value mirror the write size which you would expect.
  • sync – forces the process to write the entire file to disk before completing. Note, that dd will return before completing but the time command will not, therefore the time output will include the sync to disk.

The below example uses a 4K block size and loops 2000000 times. The resulting write size will be around 7.6GB.

time sh -c "dd if=/dev/zero of=/mnt/mount1/test.tmp bs=4k count=2000000 && sync"
2000000+0 records in
2000000+0 records out
8192000000 bytes transferred in 159.062003 secs (51501929 bytes/sec)
real 2m41.618s
user 0m0.630s
sys 0m14.998s

Now, let’s do the math. dd tells us how many bytes were written, and the time command tells us how long it took – use the real output at the bottom of the output. Use the formula BYTES / SECONDS. For these larger tests, convert bytes to KB or MB to make more sensible numbers.

(8192000000 / 1024 / 1024) / ((2 * 60) + 41.618)

Bytes converted to MB / (2 minutes + 41.618 seconds)

This gives us an average of 48.34 megabytes per second over the duration of the test.

Read

We can also use dd to test the read speed of a disk by reading the file we created and timing the process. Before we do that, we need to flush the file cache by writing another file which is about the size of the RAM installed on the test system. If we don’t do this, the file we just created will be partially in RAM and therefore the read test will not be completely read from disk.

Create a file using dd which is about the same size as the RAM installed on the system. The below assumes 2GB of RAM is installed. You can check how much RAM is installed with free.

dd if=/dev/zero of=/mnt/mount1/clearcache.tmp bs=4k count=524288

Now for the read test of our original file.

time sh -c "dd if=/mnt/mount1/test.tmp of=/dev/null bs=4k"

And process the time result the same was as when writing.

Bonnie++

Bonnie++ is a small utility with the purpose of benchmarking file system IO performance. It’s commonly available in Linux repositories or available from source from the home page.

On Debian/ Ubuntu based systems, use the apt-get command.

apt-get install bonnie++

Just like with DD, we need to minimise the effect of file caching and therefore the tests should be performed on datasets larger than the amount of RAM you have on the test system. Some people suggest that you should use datasets up to 20 times the amount of RAM, others suggest twice the amount of RAM. Whichever you use, always use the same dataset size for all tests performed to ensure the results are comparable.

There are many commands which can be used with bonnie++, too many to cover here so let’s look at some of the common ones.

  • -d – is used to specify the file system directory to use to benchmark.
  • -u – is used to run a a particular user. This is best used if you run the program as root. This is the UID or the name.
  • -g – is used to run as a particular group. This is the GID or the name.
  • -r – is used to specify the amount of RAM in MB the system has installed. This is total RAM, and not free RAM. Use free -m to find out how much RAM is on your system.
  • -b – removes write buffering and performs a sync at the end of each bonnie++ operation.
  • -s – specifies the dataset size to use for the IO test in MB.
  • -n – is the number of files to use for the create files test.
  • -m – this adds a label to the output so that you can understand what the test was at a later date.
  • -x – is used to repeat the tests n times. Change n to the number of how many times to run the tests.

bonnie++ performs multiple tests, depending on the arguments used, and does not display much until the tests are complete. When the tests complete, two outputs are visible. The bottom line is not readable (unless you really know what you are doing) however above that is a table based output of the results of the tests performed.

Let’s start with a basic test, telling bonnie++ where to test and how much RAM is installed, 2GB in this example. bonnie++ will then use a dataset twice the size of the RAM for tests. As I am running as root, I am specifying a user name.

bonnie++ -d /tmp -r 2048 -u james

bonnie++ will take a few minutes, depending on the speed of your disks and return with something similar to the output below.

Using uid:1000, gid:1000.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ubuntu 4G 786 99 17094 3 15431 3 4662 91 37881 4 548.4 17
Latency 16569us 15704ms 2485ms 51815us 491ms 261ms
Version 1.96 ------Sequential Create------ --------Random Create--------
ubuntu -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
 16 142 0 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 291us 400us 710us 382us 42us 787us
1.96,1.96,ubuntu,1,1378913658,4G,,786,99,17094,3,15431,3,4662,91,37881,4,548.4,17,16,,,,,142,0,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,16569us,15704ms,2485ms,51815us,491ms,261ms,291us,400us,710us,382us,42us,787us

The output shows quite a few statistics, but it’s actually quite straight forward once you understand the format. First, discard the bottom line (or three lines in the above output) as this is the results separated by a comma. Some scripts and graphing applications understand these results but it’s not so easy for humans. The top few lines are just the tests which bonnie++ performs and again, can be discarded.

Of cause, all the output of bonnie++ is useful in some context however we are just going to concentrate on random read/ write, reading a block and writing a block. This boils down to this section:

Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ubuntu 4G 786 99 17094 3 15431 3 4662 91 37881 4 548.4 17
Latency 16569us 15704ms 2485ms 51815us 491ms 261ms

The above output is not the easiest output to understand due to the character spacing but you should be able to follow it, just. The below points are what we are interested in, for this example, and should give you a basic understanding of what to look for and why.

  • ubuntu is the machine name. If you specified -m some_test_info this would change to some_test_info.
  • 4GB is the total size of the dataset. As we didn’t specify -s, a default of RAM x 2 is used.
  • 17094 shows the speed in KB/s which the dataset was written. This, and the next three points are all sequential reads – that is reading more than one data block.
  • 15431 is the speed at which a file is read and then written and flushed to the disk.
  • 37881 is the speed the dataset is read.
  • 548.4 shows the number of blocks which bonnie++ can seek to per second.
  • Latency number correspond with the above operations – this is the full round-trip time it takes for bonnie++ to perform the operations.

Anything showing multiple +++ is because the test could not be ran with reasonable assurance on the results because they completed too quickly. Increase -n to use more files in the operation and see the results.

bonnie++ can do much more and, even out of the box, show much more but this will give you some basic figures to understand and compare. Remember, always perform tests on datasets larger than the RAM you have installed, multiple times over the day, to reduce the chance of other processes interfering with the results.


Testing network speed with Iperf

Category : How-to

Get Social!

iperfIperf is an Open source network bandwidth testing application, available on Linux, Windows and Unix. Iperf can be used in two modes, client and server. The server runs on the remote host and listens for connections from the client. The client is where you issue the bandwidth test parameters, and connect to a remote server.

You can install Iperf using apt-get on ubuntu.

apt-get install iperf

Once installed, on the remote host run Iperf in client mode. If you wish to run the server in daemon mode, add -D to the command.

iperf -s

Iperf has many configurable options for testing network throughput. For our test, we will use TCP connections to a remote server at IP 10.1.1.50. The test will use 4 threads, each sending data and the test will be performed in both directions.

iperf -c 10.1.1.50 -r -P 4

The result will look similar to the below output. This example shows a total throughput of 9.42 Gbits/ second in one direction and 11.9 Gbits/ second in the other direction. As the results can fluctuate depending on the load of the server, or how congested the network is between the servers, it’s best to run each test 3 – 4 times at different times of the day and take an average.

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.50.112, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[ 8] local 10.1.1.51 port 39372 connected with 10.1.1.50 port 5001
[ 6] local 10.1.1.51 port 39370 connected with 10.1.1.50 port 5001
[ 3] local 10.1.1.51 port 39369 connected with 10.1.1.50 port 5001
[ 7] local 10.1.1.51  port 39371 connected with 10.1.1.50 port 5001
[ ID] Interval Transfer Bandwidth
[ 8] 0.0-10.0 sec 2.77 GBytes 2.38 Gbits/sec
[ 3] 0.0-10.0 sec 2.92 GBytes 2.51 Gbits/sec
[ 6] 0.0-10.0 sec 3.30 GBytes 2.83 Gbits/sec
[ 7] 0.0-10.0 sec 1.99 GBytes 1.71 Gbits/sec
[SUM] 0.0-10.0 sec 11.0 GBytes 9.42 Gbits/sec
[ 5] local 10.1.1.50 port 5001 connected with 10.1.1.51 port 49434
[ 3] local 10.1.1.50 port 5001 connected with 10.1.1.51 port 49435
[ 9] local 10.1.1.50 port 5001 connected with 10.1.1.51 port 49437
[ 8] local 10.1.1.50 port 5001 connected with 10.1.1.51 port 49436
[ 5] 0.0-10.0 sec 3.32 GBytes 2.85 Gbits/sec
[ 9] 0.0-10.0 sec 4.18 GBytes 3.58 Gbits/sec
[ 8] 0.0-10.0 sec 3.28 GBytes 2.81 Gbits/sec
[ 3] 0.0-10.0 sec 3.11 GBytes 2.66 Gbits/sec
[SUM] 0.0-10.0 sec 13.9 GBytes 11.9 Gbits/sec

For more test parameters see our Iperf cheat sheet.


Visit our advertisers

Quick Poll

Are you using Docker.io?

Visit our advertisers