For whoever running benchmarks or POC or about to move into production with GigaSpaces:
Here are set of tuning activities to be done prior moving into production or when running your POC or benchmarks. These are listed according their priority.
Most of the following are critical for environments with large amount of spaces/clients.
See below also additional info regarding Solaris OS you might find interesting.
Memory planning
HW capacity planning should be done in advance to provide servers with sufficient free memory to avoid swapping at any point in time. With no sufficient free RAM virtual memory starts swapping memory pages to the disk decreasing overall performance 20 to 50 times. It should be strictly avoided.
Command to check free memory in Linux:
$ free
Used swap should be zero, while free memory should be sufficient to host all GigaSpaces and third party components planned for deployment. With all SW components loaded after reboot, free memory should be at least 300MB. Having free memory below 50MB is critically low with the danger of OS or application crash.
Ulimit and file descriptors tuning
This is critical when running multiple GSC on the same machine or when having multi threaded clients.
Set in /etc/security/limits.conf (root permissions required):
- soft nofile 16384
- hard nofile 65536
- soft nproc 2047
- hard nproc 32768
Reboot the machine after changing this file to allow the changes to take affect.
Nofile is relevant for amount of connection/sockets and nproc is relevant for amount of threads.
In general each application thread opens a connection/sockets to the space. Every sync replication using a dedicated connection/socket.
GSC JVM Tuning
Since the applications using GigaSpaces leveraging in most cases very fast CPUs, the amount of temp objects created is relatively large for the GC to handle with its default settings. This means careful tuning of the JVM is very important. In extreme cases we would even suggest to use Java real time that is slower than the regular one but much more deterministic with its GC activity.
See below example of settings we will recommend for applications that have lot of temp objects created and cannot afford long pauses - these are good for cases where the business logic and the space are collocated (aka embedded space):
-Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=10 -XX:CMSIncrementalDutyCycle=50 -XX:ParallelGCThreads=8 -XX:+UseParNewGC -Xmn150m -XX:MaxGCPauseMillis=2000 -XX:GCTimeRatio=10 -XX:+DisableExplicitGC
See more info here:
http://www.gigaspaces.com/wiki/display/OLH/Memory+Management+-+Space+Schema
Please note that -XX:+UseConcMarkSweepGC has the heaviest impact on performance - decrease of 40%.
The following set of parameters shows 20% better performance than with -XX:+UseConcMarkSweepGC while the pause size still is below 100msec in embedded test with payload 10KB and 100 threads:
-Xms2g -Xmx2g -Xmn150m -XX:GCTimeRatio=2 -XX:ParallelGCThreads=8 -XX:+UseParNewGC -XX:MaxGCPauseMillis=2000 -XX:+DisableExplicitGC
GSM/LUS JVM Tuning
The GSM/LUS are essential components within the GigaSpaces. The LUS is GigaSpaces directory service. The GSM is the deployment and service management service. Each of these consume some resources and have some periodic events going on that are essentially some kind of heart bit mechanism that must happen correctly to identify the health of the system.
The correct way to deploy these would be to separate these in 2 processes and run these on dedicated machines with plenty of memory (1G min) and careful tuning of the JVM GC to reduce the young generation size that will minimize GC pauses and avoid stop the world events that could lead to split brain scenario, no ability to access the spaces and general system instability.
When there is no dedicated machine to run the LUS / GSM , we recommend having the LUS/GSM on machines with less GSCs in order not to saturate the machines CPU that will lead to pause with the LUS activity that will make the system unstable.
With Solaris machines (see below) you can bind a process to a dedicated CPU or even allocate min amount of CPU percentage to make sure it will get the resources it needs.
Multiple LUS will provide better HA , but will consume more resources from the spaces and in fact some more resources from the clients processes. The spaces will need to register themselves into multiple LUSs and the clients and spaces will get events to update their local lookup cache from all running LUSs. Please note we removed the LUS-client chat with 6.5 which makes the LUS to work less. This makes it more robust and reduce the amount of JVM temp objects it creates that may lead to long GC activity.
Due-to the above the recommendation is to have 2 LUS/ GSM running. No more.
With 6.5 we have improved large portions of the LUS and improved the LRMI communication mechanism that makes the LUS much more robust. This allows the LUS to support very large clusters with very large amount of clients. We have users running 6.5 with clusters larger than 100 spaces and more than 1000 clients.
6.0 will support up to 20 partitions out of the box but will need some tuning to its LUS JVM to be on the safe side. For larger clusters make sure you use 6.5 or above.
In extreme cases with gigantic systems we can scale the LUS activity by running multiple LUSs , each responsible for subset of the clients/spaces.
Here is an example for settings for the GSM:
-Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=10 -XX:CMSIncrementalDutyCycle=50 -XX:ParallelGCThreads=8
-XX:+UseParNewGC -Xmn50m -XX:MaxGCPauseMillis=1000 -XX:GCTimeRatio=10 -XX:+DisableExplicitGC
CPU resource management for GSM/LUS - Specifically for Solaris
To make sure the GSM/LUS will have CPU resources and continue to serve clients even under high CPU usage by other processes running on the machine you can use Solaris OS pbind and priocntl commands.
pbind -b <CPUiD> <processID>
example:
pbind -b 0 12312
Binding to a specific core could help when a single core is sufficient to handle the amount of incoming requests.
A more powerful option would be to use time slicing:
priocntl -s -c RT -t 500 -i pid <ProcessID>
The above means that 500 ms of each second (not less if the process demands it) will be allocated to the ProcessID.
Another example:
priocntl -e -c RT -t 700 su - gspaces -c gsm.sh
The above commands run by the root user.
Additional options of CPU affinity as binding to processor sets are available for Solaris which can improve overall application performance and scalability.
GSC monitors tuning - \GigaSpacesXAP\config\services\services.config
To avoid resource leaks occurs when forking processes (used to calculate machine free disk) and wasting CPU cycles when the GSC monitoring its machine resources to trigger the SLA events you should configure the following. If you are not using SLA based events you should disable these or place long reportRate.
com.gigaspaces.management.system.cpu {
reportRate = 3000000;
sampleSize = 2;
// Default CPU utilization high threshold watermark is 99%
thresholdValues = new org.jini.rio.core.ThresholdValues(0.0, 0.99);
}
com.gigaspaces.management.system.memory {
reportRate = 1500000;
sampleSize = 2;
// Default Memory utilization high threshold watermark is 99%
thresholdValues = new org.jini.rio.core.ThresholdValues(0.0, 0.99);
}
com.gigaspaces.management.system.disk {
reportRate = 6000000;
sampleSize = 2;
// Default Disk utilization high threshold watermark is 99%
thresholdValues = new org.jini.rio.core.ThresholdValues(0.0, 0.99);
enabled = false;
}
GSM provisioningPoolMaxThreads tuning - \GigaSpacesXAP\config\services\services.config
Increase the provisioningPoolMaxThreads to allow more services (spaces) to deployed in the same time:
org.jini.rio.monitor {
provisioningPoolMaxThreads = 32;
}
Replication mode Tuning
GigaSpaces 6.5 using by default Sync replication mode that is optimized for concurrent fast activity.
sync-rec-ack replication mode is optimized for small amount of concurrent clients with slow activity rate.
You might want to use sync-rec-ack mode to speed up your replication speed.
LRMI tuning
6.5 includes the ability to configure the amount of NIO selectors - This is relevant with multi core machines.
com.gs.transport_protocol.lrmi.selector.threads=8
Client liveliness
The following is relevant in cases with lots of client or lots of spaces. It will avoid the heart bit and chatting going on between the client and spaces to identify failed space. Channing the above will result in some cases some delay with failover.
-Dcom.gs.cluster.livenessMonitorFrequency=Long.MAX_VALUE
-Dcom.gs.cluster.livenessDetectorFrequency=Long.MAX_VALUE
Fail-Over and active election parameters
Tuning Jini TaskManager - com.sun.jini.thread.TaskManager(80, 15000, 2.0F, "Reggie Comm Task", 10)
-Dcom.gs.transport_protocol.lrmi.connect_timeout=3s
-Dcom.gs.transport_protocol.lrmi.request_timeout=3s
-Dcom.gs.jini.config.maxLeaseDuration=1000
-Dcom.gs.jini.config.roundTripTime=1000
-Dcom.gs.failover.standby-wait-time=1000
Cluster schema:
<prop key="cluster-config.groups.group.fail-over-policy.active-election.yield-time">1000</prop>
<prop key="cluster-config.groups.group.fail-over-policy.active-election.fault-detector.invocation-delay">1000</prop>
<prop key="cluster-config.groups.group.fail-over-policy.active-election.fault-detector.retry-count">1</prop>
<prop key="cluster-config.groups.group.fail-over-policy.active-election.fault-detector.retry-timeout">1000</prop>
Space lookup lease tuning - D:\GigaSpacesXAP6\config\services\advanced-space.config
net.jini.lease.LeaseRenewalManager {
taskManager = new com.sun.jini.thread.TaskManager(11, 15000, 1.0F, "Space LeaseRenewalManager Task", 10);
//Default value for net.jini.lease.LeaseRenewalManager.roundTripTime
roundTripTime=1000;
net.jini.lookup.JoinManager {
taskManager = new com.sun.jini.thread.TaskManager(15, 30000, 1.0F, "Space JoinManager Task", 10);
//Default value for net.jini.lookup.JoinManager.maxLeaseDuration
maxLeaseDuration=2000;
}
OS Networking Tuning
By default the out of the box network settings are not tuned for applications using network connections with mid/large packets.
See below list of parameters that should be tuned and recommended values for replication over the WAN. For LAN based environments different settings should be used.
The parameters should be set in /etc/sysctl.conf (root permissions required; default parameters set during OS installation should be preserved):
net.ipv4.tcp_no_metrics_save = 1
net.core.optmem_max=10000000
net.core.rmem_default=131072
net.core.rmem_max=1048576
net.core.wmem_default=131072
net.core.wmem_max=1048576
net.ipv4.tcp_max_tw_buckets=200000
net.ipv4.tcp_mem=131072 131072 10000000
net.ipv4.tcp_rmem=131072 131072 10000000
net.ipv4.tcp_wmem=131072 131072 10000000
net.ipv4.tcp_max_syn_backlog=50000
net.core.somaxconn=300000
net.core.netdev_max_backlog=300000
net.ipv4.tcp_reordering=20
net.ipv4.tcp_timestamps=1
net.ipv4.tcp_window_scaling=1
The machine should be rebooted to allow the settings to take affect.
To enable large TCP windows over 64KB tcp_window_scaling should be set to 1 on both TCP communication peers.
Sometimes routers in between may decrease TCP window size below 64KB when tcp window scaling is enabled resulting performance decrease.
In such cases tcp_window_scaling should be set to 0, which will retain TCP window size equal to 64KB.
To enable at runtime - kernel parameters with updated values written in /etc/sysctl.conf:
/sbin/sysctl -p (reboot is preferable, there are confirmed cases when the settings, in fact, were enabled just after reboot)
To review kernel parameters:
/sbin/sysctl -a
Linux e poll
-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.EPollSelectorProvider
See:
http://java.sun.com/j2se/1.5.0/ReleaseNotes.html
Disable multicast activity
If you don't need it - turn it off.
Set -Dcom.gs.multicast.enabled=false in server/space/LUS/GSM AND client sides.
Verify its disabled using com.sun.jini.reggie.level = CONFIG and net.jini.discovery.LookupDiscovery.level = CONFIG
Some additional info regarding Solaris OS:
1) Solaris has sophisticated resource management mechanism including CPU time sharing with hard real-time support + ability to establish affinity to the processor sets. These mechanisms assure process CPU starvation avoidance assuring critical processes with sufficient CPU under heavy loads. Neither Linux nor Windows has such resource management - see
http://www.opensolaris.org/os/article/2005-10-14_a_comparison_of_solaris__linux__and_freebsd_kernels/. In advance, ability to define affinity between groups of processes to the specific processor sets provides possibility to increase CPU cache hit rates increasing application overall throughput and scalability -
http://developers.sun.com/solaris/articles/scalable/.
2) Time sharing up to hard real-time level with microsec resolution assures application response time determinism. Combined with Sun Real-Time JVM it is possible to get sub-millisec response time determinism
3) Solaris stability is exceptional. System could run for year and more with no reboot. Under memory starvation condition RedHat Linux crashes which never happens to Solaris
4) Solaris ability to run many processes/threads is exceptional. It can run over one thousand+ processes effectively which never can be done by Windows
5) Starting Solaris 9 the whole TCP stack was completely rewritten and TCP performance and stability was significantly improved. It includes introduction of many clever defaults which represent last years experience in TCP communications
6) Sun has CoolThread massive multi-threading support with its Niagara server line. Latest Niagara2 SMP servers show exceptional performance and scalability with its up to 128 simultaneous thread support
7) Open Source community is using Solaris as a primary platform meaning that almost all open source projects are running/tested on Solaris
8) ZFS filesystem is included into Solaris and it is free. ZFS is unique and only filesystem with atomic transactions support. It means that data it preserves are instantaneous snapshots at any given point in time which preserve ever the time consistency. In simple words it means that if any other filesystem could be corrupted as result of heavy database load at the moment of crash, it will never happen with ZFS. At the same time ZFS is optimized for remote data replication providing very significant performance boost when remote storage replication is in place.
Shay
Edited by: Shay Hassidim on Sep 2, 2008 11:28 AM