Honda Super Hawk

SolarisTM 2.x - Tuning Your TCP/IP Stack and More


Last update: 10.04.2001 (change log)

Please check your location line carefully. If you don't see http://www.sean.de/Solaris/ in your location bar, you might want to check with the original site for the most up to date information.

Important Notice!

SUN managed to publish a Solaris Tunable Parameters Reference Manual, applying to Solaris 8, HW 01/01. You might want to check there for anything you miss here.

Table of contents

  1. Introduction
    1.1 History
    1.2 Quick intro into ndd
    1.3 How to read this document
  2. TCP connection initiation
  3. Retransmission related parameters
  4. Path MTU discovery
  5. Further advice, hints and remarks
    5.1 Common TCP timers
    5.2 Erratic IPX behaviors
    5.3 Common IP parameters
    5.4 TCP and UDP port related parameters
  6. Windows, buffers and watermarks
  7. Tuning your system
    7.1 Things to watch
    7.2 General entries in the file /etc/system
    7.3 System V IPC related entries
    7.4 How to find further entries
  8. 100 Mbit ethernet and related entries
    8.1 The hme interface
    8.2 Other problems
  9. Recommended patches
  10. Literature
    10.1 Books
    10.2 Internet resources
    10.3 RFC, mentioned and otherwise
    10.4 Further material
  11. Solaris' Future
    11.1 Solaris 7
    11.2 Solaris 8
    11.3 Solaris 9
  12. Uncovered material
  13. Scripts
  14. List of things to do

Appendices are separate documents. They are quoted from within the text, but you might be interested in them when downloading the current document. If you say "print" for this document, the appendices will not be printed. You have to download and print them separately.

  1. Simple transactions using TCP
  2. System V IPC parameter
  3. Retransmission behavior
  4. Slow start implications
  5. The change log
  6. Glossary (first attempt)
  7. Index (first attempt)

1. Introduction

Use at your own risk!

If your system behaves erratically after applying some tweaks, please don't blame me. Remember to have a backup handy before starting to tune. Always make backup copies of the files you are changing. I tried carefully to assemble the information you are seeing here, aimed at improved system performance. As usual, there are no guarantees that what worked for me will work for you. Please don't take my recommendation at heart: They are starting points, not absolutes. Always read my reasoning, don't use them blindly.

Before you start, you ought to grab a copy of the TCP state transition diagram as specified in RFC 793 on page 23. The drawback is the missing error correction supplied by later RFCs. There is an easier way to obtain blowup printouts to staple to your office walls. Grab a copy of the PostScript file pocket guide, page 2 accompanying Stevens' TCP/IP Illustrated Volume 1 [4]. Or simply open the book at figure 18.12.

Please share your knowledge

I try to assemble this page and related material for everybody interested in gaining more from her or his system. If you have an item I didn't cover, but which you deem worthwhile, please write to me. A few dozen or so regular readers of this page will thank you for it. I am only human, thus if you stumble over an error, misconception, or blatant nonsense, please have me correct it. In the past, there were quite a few mistakes.

The set of documents may look a trifle colorful, or just odd, if your browser supports cascading stylesheets. Care was taken to select the formatting tags in a way that the printed output still resembles the intentions of the author, and that the set of documents is still viewable with browser like Mosaic or Lynx. Stylesheets were used as an optical enhancement. Most notable is the different color of interior and external links. Interior links are shown in greenish colors, and will be rendered within the same frame. External links on the other hand are shown in bluish colors, and all will be shown in the same new frame. If you leave it open, a new external link will be shown within the same window. Literature references within the text are often interior links, pointing to the literature section, where the external links are located.

1.1 History

This page and the related work have a long history in gathering. I started out peeking wide eyed over the shoulders of two people from a search engine provider when they were installing the German server of a customer of my former employer. My only alternative resource of tuning information was the brilliant book TCP/IP Illustrated 1 [4] by Stevens. I started gathering all information about tuning I was able to get my hands upon. The cumulation of these you are experiencing on these pages.

1.2 Quick intro into ndd

Solaris allows you to tune, tweak, set and reset various parameters related to the TCP/IP stack while the system is running. Back in the SunOS 4.x days, one had to change various C files in the kernel source tree, generate a new kernel, reboot the machine and try out the changes. The Solaris feature of changing the important parameters on the fly is very convenient.

Many of the parameters I mention in the rest of the document you are reading are time intervals. All intervals are measured in milliseconds. Other parameters are usually bytecounts, but a few times different units of measurements are used and documented. A few items appear totally unrelated to TCP/IP, but due to the lack of a better framework, they materialized on this page.

Most tunings can be achieved using the program ndd. Any user may execute this program to read the current settings, depending on the readability of the respective device files. But only the super user is allowed to execute ndd -set to change values. This makes sense considering the sensitive parameters you are tuning. Details on the use of ndd can be obtained from the respective manual page.

ndd will become your friend, as it is the major tool to tweak most of the parameters described in this document. Therefore you better make yourself familiar with it. A quick overview will be given in this section, too. ndd is not limited to tweaking TCP/IP related parameters. Many other devices, which have a device file underneath /dev and a kernel module can be configured with the help of ndd. For instance, any networking driver which supports the Data Link Provider Interface (DLPI) can be configured.

The parameters supplied to ndd are symbolic keys indexing either a single usually numerically value, or a table. Please note that the keys usually (but not always) start out with the module or device name. For instance, changing values of the IP driver, you have to use the device file /dev/ip and all parameters start out with ip_. The question mark is the most notable exception to this rule.

1.2.1 Interactive mode

The interactive mode allows you to inspect and modify a device, driver or module interactively. In order to inspect the available keyword names associated with a parameter, just type the question mark. The next item will explain about the output format of the parameter list.

# ndd /dev/tcp
name to get/set ? tcp_slow_start_initial
value ? 
length ? 
2
name to get/set ? ^D

The example above queries the TCP driver for the value of the slow start feature in an interactive fashion. The typed input is shown boldface.

1.2.2 Show all available parameters

If you are interested in the parameters you can tweak for a given module, query for the question mark. This special parameter name is part of all ndd configurable material. It tells the names of all parameters available - including itself - and the access mode of the parameter.

# ndd /dev/icmp \?
?                             (read only)
icmp_wroff_extra              (read and write)
icmp_def_ttl                  (read and write)
icmp_bsd_compat               (read and write)
icmp_xmit_hiwat               (read and write)
icmp_xmit_lowat               (read and write)
icmp_recv_hiwat               (read and write)
icmp_max_buf                  (read and write)

Please mind that you have to escape the question mark with a backslash from the shell, if you are querying in the non-interactive fashion as shown above.

1.2.3 Query the value of one or more parameters (read access)

At the command line, you often need to check on settings of your TCP/IP stack or other parameters. By supplying the parameter name, you can examine the current setting. It is permissible to mention several parameters to check on at once.

 # ndd /dev/udp udp_smallest_anon_port
 32768
 # ndd /dev/hme link_status link_speed link_mode
 1

 1
 
 1

The first example checks on the smallest anonymous port UDP may use when sending a PDU. Please refer to the appropriate section later in this document on the recommended settings for this parameter.

The second example checks the three important link report values of a 100 Mbit ethernet interface. The results are separated by an empty line, because some parameters may refer to tabular values instead of a single number.

1.2.4 Modify the value of one parameter (write access)

This mode of interaction with ndd will frequently be found in scripts or when changing value at the command line in a non-interactive fashion. Please note that you may only set one value at a time. The scripts section below contains examples in how to make changes permanent using a startup script.

 # ndd -set /dev/ip ip_forwarding 0

The example will stop the forwarding of IP PDUs, even if more than one non-local interface is active and up. Of course, you can only change parameters which are marked for both, reading and writing.

1.2.5 Further remarks

Andres Kroonmaa kindly supplied a nifty script to check all existing values for a network component (tcp, udp, ip, icmp, etc.). Usually I do the same thing using a small Perl script.

1.3 How to read this document

This document is separated into several chapters with little inter-relation. It is still advisable to loosely follow the order outlined in the table of contents.

The first chapter entirely focusses on the TCP connection queues. It is quite long for such small topic, but it is also meant to introduce you into my style of writing. The next chapter deals with TCP retransmission related parameters that you can adjust to your needs. The chapter is more concise. One chapter on deals with path MTU discovery, as there used to be problems with older versions of Solaris. Recent versions usually do not need any adjustments.

The fifth chapter is a kind of catch-all. Some TCP, some UDP and some IP related parameters are explained (forwarding, port ranges, timers), and a quick detour into bug 1226653 explains that some versions were capable of sending packages larger than the MTU. The following chapter in depth deals with windows, buffers and related issues.

Chapter seven detours from the ndd interface, and focusses on variables you can set in your /etc/system file, as some things can only be thus managed. Another part of that chapter deals with the hme interface and appropriate tunables. The chapter may be split in future, and parts of it are already found in the appendices.

The chapter dealing with patches, an important topic with any OS, just points you to various sources, and only mentions some essential things for older versions of Solaris.

Literature exists in abundance. The literature sections is more a lose collection of links and some books that I consider essential when working with TCP/IP, not limited to Solaris. The RFC sections is kind of hard to keep up-to-date, but then, I reckon you know how to read the rfc-index file.

The final chapters quickly glance at new or at one time new versions of Solaris - time makes them obsolete. The chapter is there for historical reason, more or less. The scripts sections deals with the nettune script used by YaSSP. It finishes with some TODO material.

2. TCP connection initiation

This section is dedicated exclusively to the various queues and tunable variable(s) used during connection instantiation. The socket API maintains some control over the queues. But in order to tune anything, you have to understand how listen and accept interact with the queues. For details, see the various Stevens books mentioned in the literature section.

When the server calls listen, the kernel moves the socket from the TCP state CLOSED into the state LISTEN, thus doing a passive open. All TCP servers work like this. Also, the kernel creates and initializes various data structures, among them the socket buffers and two queues:

incomplete connection queue

This queue contains an entry for every SYN that has arrived. BSD sources assign so_q0len entries to this queue. The server sends off the ACK of the client's SYN and the server side SYN. The connection get queued and the kernel now awaits the completion of the TCP three way handshake to open a connection. The socket is in the SYN_RCVD state. On the reception of the client's ACK to the server's SYN, the connection stays one round trip time (RTT) in this queue before the kernel moves the entry into the

completed connection queue

This queue contains an entry for each connection for which the three way handshake is completed. The socket is in the ESTABLISHED state. Each call to accept() removes the front entry of the queue. If there are no entries in the queue, the call to accept usually blocks. BSD source assign a length of so_qlen to this queue.

Both queues are limited regarding their number of entries. By calling listen(), the server is allowed to specify the size of the second queue for completed connections. If the server is for whatever reason unable to remove entries from the completed connection queue, the kernel is not supposed to queue any more connections. A timeout is associated with each received and queued SYN segment. If the server never receives an acknowledgment for a queued SYN segment, TCP state SYN_RCVD, the time will run out and the connection thrown away. The timeout is an important resistance against SYN flood attacks.

A model of TCP listening queues   TCP connection initiation timing diagram
Figure 1: Queues maintained for listening sockets.   Figure 2: TCP three way handshake, connection initiation.

Historically, the argument to the listen function specified the maximum number of entries for the sum of both queues. Many BSD derived implementations multiply the argument with a fudge factor of 3/2. Solaris <= 2.5.1 do not use the fudge factor, but adds 1, while Solaris 2.6 does use the fudge factor, though with a slightly different rounding mechanism than the one BSD uses. With a backlog argument of 14, Solaris 2.5.1 servers can queue 15 connections. Solaris 2.6 server can queue 22 connections.

Stevens shows that the incomplete connection queue does need more entries for busy servers than the completed connection queue. The only reason for specifying a large backlog value is to enable the incomplete connection queue to grow as SYN arrive from clients. Stevens shows that moderately busy webserver has an empty completed connection queue during 99 % of the time, but the incomplete connection queue needed 15 or less entries in 98 % of the time! Just try to imagine what this would mean for a really busy webcache like Squid.

Data for an established connection which arrives before the connection is accept()ed, should be stored into the socket buffer. If the queues are full when a SYN arrived, it is dropped in the hope that the client will resend it, hopefully finding room in the queues then.

According to Cockroft [2], there was only one listen queue for unpatched Solari <= 2.5.1. Solari >= 2.6 or an applied TCP patch 103582-12 or above splits the single queue in the two shown in figure 1. The system administrator is allowed to tweak and tune the various maxima of the queue or queues with Solaris. Depending on whether there are one or two queues, there are different sets of tweakable parameters.

The old semantics contained just one tunable parameter tcp_conn_req_max which specified the maximum argument for the listen(). The patched versions and Solaris 2.6 replaced this parameter with the two new parameters tcp_conn_req_max_q0 and tcp_conn_req_max_q. A SunWorld article on 2.6 by Adrian Cockroft tells the following about the new parameters:

tcp_conn_req_max [is] replaced. This value is well-known as it normally needs to be increased for Web servers in older releases of Solaris 2. It no longer exists in Solaris 2.6, and patch 103582-12 adds this feature to Solaris 2.5.1. The change is part of a fix that prevents denial of service from SYN flood attacks. There are now two separate queues of partially complete connections instead of one.

tcp_conn_req_max_q0 is the maximum number of connections with handshake incomplete. A SYN flood attack could only affect this queue, and a special algorithm makes sure that valid connections can still get through.

tcp_conn_req_max_q is the maximum number of completed connections waiting to return from an accept call as soon as the right process gets some CPU time.

In other words, the first specifies the size of the incomplete connection queue while the second parameters assigns the maximum length of the completed connection queue. All three parameters are covered below.

You can determine if you need to tweak this set of parameters by watching the output of netstat -sP tcp. Look for the value of tcpListenDrop, if available on your version of Solaris. Older versions don't have this counter. Any value showing up might indicate something wrong with your server, but then, killing a busy server (like squid) shuts down its listening socket, and might increase this counter (and others). If you get many drops, you might need to increase the appropriate parameter. Since connections can also be dropped, because listen() specifies a too small argument, you have to be careful interpreting the counter value. On old versions, a SYN flood attack might also increase this counter.

Newer or patched versions of Solaris, with both queues available, will also have the additional counters tcpListenDropQ0 and tcpHalfOpenDrop. Now the original counter tcpListenDrop counts only connections dropped from the completed connection queue, and the counter ending in Q0 the drops from the incomplete connection queue. Killing a busy server application might increase either or both counters. If the tcpHalfOpenDrop shows up values, your server was likely to be the victim of a SYN flood. The counter is only incremented for dropping noxious connection attempts. I have no idea, if those will also show up in the Q0 counter, too.

tcp_conn_req_max
default 8 (max. 32), since 2.5 32 (max. 1024), recommended 128 <= x <= 1024
since 2.6 or 2.5.1 with patches 103630-09 and 103582-12 or above applied:
see tcp_conn_req_max_q and tcp_conn_req_max_q0

The current parameter describes the maximum number of pending connection requests queued for a listening endpoint in the completed connection queue. The queue can only save the specified finite number of requests. If a queue overflows, nothing is sent back. The client will time out and (hopefully) retransmit.

The size of the completed connection queue does not influence the maximum number of simultaneous established connections after they were accepted nor does it have any influence on the maximum number of clients a server can serve. With Solaris, the maximum number of file descriptors is the limiting factor for simultaneous connections, which just happened to coincide with the maximum backlog queue size.

From the viewpoint of TCP those connections placed in the completed connection queue are in the TCP state ESTABLISHED, even though the application has not reaped the connection with a call to accept. That is the number limited by the size of the queue, which you tune with this parameter. If the application, for some reason, does not release entries from the queue by calling accept, the queue might overflow, and the connection is dropped. The client's TCP will hopefully retransmit, and might find a place in the queue.

Solaris offers the possibility to place connections into the backlog queue as soon as the first SYN arrives, called eager listening. The three way handshake will be completed as soon as the application accept()s the connection. The use of eager listening is not recommended for production systems.

Solari < 2.5 have a maximum queue length of 32 pending connections. The length of the completed connection queue can also be used to decrease the load on an overloaded server: If the queue is completely filled, remote clients will be denied further connections. Sometimes this will lead to a connection timed out error message.

Naively, I assumed that a very huge length might lead to a long service time on a loaded server. Stevens showed that the incomplete connection queue needs much more attention than the completed connection queue. But with tcp_conn_req_max you have no option to tweak that particular length.

Earlier versions of this document suggested to tune tcp_conn_req_max with regards to the values of rlim_fd_max and rlim_fd_cur, but the interdependencies are more complex than any rule of thumb. You have to find your own ideal. When a connection is still in the queue, only the queue length limits the number of entries. Connections taken from the queue are put into a file descriptor each.

There is a trick to overcome the hardcoded limit of 1024 with a patch. SunSolve shows this trick in connection with SYN flood attacks. A greatly increased listen backlog queue may offer some small increased protection against this vulnerability. On this topic also look at the tcp_ip_abort_cinterval parameter. Better, use the mentioned TCP patches, and increase the q0 length.

echo "tcp_param_arr+14/W 0t10240" | adb -kw /dev/ksyms /dev/mem

This patch is only effective on the currently active kernel, limiting its extend to the next boot. Usually you want to append the line above on the startup script /etc/init.d/inetinit. The shown patch increases hard limit of the listen backlog queue to 10240. Only after applying this patch you may use values above 1024 for the tcp_conn_req_max parameter.

A further warning: Changes to the value of tcp_conn_req_max parameter in a running system will not take effect until each listening application is restarted. The backlog queue length is evaluated whenever an application calls listen(3N), usually once during startup. Sending a HUP signal may or may not work; personally I prefer to TERM the application and restart them manually or, even better, use a startup script.

tcp_conn_req_max_q0
since 2.5.1 with patches 103630-09 and 103582-12 or above applied: default 1024;
since 2.6: default 1024, recommended 1024 <= x <= 10240

After installing the mentioned TCP patches, alternatively after installing Solaris 2.6, the parameter tcp_conn_req_max is no longer available. In its stead the new parameters tcp_conn_req_max_q and tcp_conn_req_max_q0 emerged. tcp_conn_req_max_q0 is the maximum number of connections with handshake incomplete, basically the length of the incomplete connection queue.

In other words, the connections in this queue are just being instantiated. A SYN was just received from the client, thus the connection is in the TCP SYN_RCVD state. The connection cannot be accept()ed until the handshake is complete, even if the eager listening is active.

To protect against SYN flooding, you can increase this parameter. Also refer to the parameter tcp_conn_req_max_q above. I believe that changes won't take effect unless the applications are restarted.

tcp_conn_req_max_q
since 2.5.1 with patches 103630-09 and 103582-12 or above applied: default 128;
since 2.6: default 128, recommended 128 <= x <= tcp_conn_req_max_q0

After installing the mentioned TCP patches, alternatively after installing Solaris 2.6, the parameter tcp_conn_req_max is no longer available. In its stead the new parameters tcp_conn_req_max_q and tcp_conn_req_max_q0 emerged. tcp_conn_req_max_q is the length of the completed connection queue.

In other words, connections in this queue of length tcp_conn_req_max_q have completed the three way handshake of a TCP open. The connection is in the state ESTABLISHED. Connections in this queue have not been accept()ed by the server process (yet).

Also refer to the parameter tcp_conn_req_max_q0. Remember that changes won't take effect unless the applications are restarted.

tcp_conn_req_min
Since 2.6: default 1, recommended: don't touch

This parameter specifies the minimum number of available connections in the completed connection queue for select() or poll() to return "readable" for a listening (server) socket descriptor.

Programmers should note that Stevens [7] describes a timing problem, if the connection is RST between the select() or poll() call and the subsequent accept() call. If the listening socket is blocking, the default for sockets, it will block in accept() until a valid connection is received. While this seems no tragedy with a webserver or cache receiving several connection requests per second, the application is not free to do other things in the meantime, which might constitute a problem.

3. Retransmission related parameters

The retransmission timeout values used by Solaris are way too aggressive for wide area networks, although they can be considered appropriate for local area networks. SUN thus did not follow the suggestions mentioned in RFC 1122. Newer releases of the Solaris kernel are correcting the values in question:

The recommended upper and lower bounds on the RTO are known to be inadequate on large internets. The lower bound SHOULD be measured in fractions of a second (to accommodate high speed LANs) and the upper bound should be 2*MSL, i.e., 240 seconds.

Besides the retransmit timeout (RTO) value two further parameters R1 and R2 may be of interest. These don't seem to be tunable via any Solaris' offered interface that I know of.

The value of R1 SHOULD correspond to at least 3 retransmissions, at the current RTO. The value of R2 SHOULD correspond to at least 100 seconds.

[...]

However, the values of R1 and R2 may be different for SYN and data segments. In particular, R2 for a SYN segment MUST be set large enough to provide retransmission of the segment for at least 3 minutes. The application can close the connection (i.e., give up on the open attempt) sooner, of course.

Great many internet servers which are running Solaris do retransmit segments unnecessarily often. The current condition of European networks indicate that a connection to the US may take up to 2 seconds. All parameters mentioned in the first part of this section relate to each other!

As a starter take this little example. Consider a picture, size 1440 byte, LZW compressed, which is to be transferred over a serial linkup with 14400 bps and using a MTU of 1500. In the ideal case only one PDU gets transmitted. The ACK segment can only be sent after the complete PDU is received. The transmission takes about 1 second. These values seem low, but they are meant as 'food for thought'. Now consider something going awry...

Solaris 2.5.1 is behaving strange, if the initial SYN segment from the host doing the active open is lost. The initial SYN gets retransmitted only after a period of 4 * tcp_rexmit_interval_initial plus a constant C. The time is 12 seconds with the default settings. More information is being prepared on the retransmission test page.

The initial lost SYN may or may not be of importance in your environment. For instance, if you are connected via ATM SVCs, the initial PDU might initiate a logical connection (ATM works point to point) in less than 0.3 seconds, but will still be lost in the process. It is rather annoying for a user of 2.5.1 to wait 12 seconds until something happens.

tcp_rexmit_interval_initial
default 500, since 2.5.1 3000, recommended >= 2000 (500 for special purposes)

This interval is waited before the last data sent is retransmitted due to a missing acknowledgment. Mind that this interval is used only for the first retransmission. The more international your server is, the larger you should chose this interval.

Special laboratory environments working in LAN-only environments might be better off with 500 ms or even less. If you are doing measurements involving TCP (which is almost always a bad idea), you should consider lowering this parameter.

Why do I consider TCP measurements a bad idea? If ad-hoc approaches are used, or there is no deeper knowledge of the mechanics of TCP, you are bound to arrive at wrong conclusions. Unless there are TCP dumps to document that indeed what you expect is actually happening, results may lead to wrong conclusions. If done properly, there is nothing wrong with TCP measurements. The same rules apply, if you are measuring protocols on top of TCP.

There are lots of knobs and dials to be fiddled with - all of which need to be documented along with the results. Scientific experiments need to be repeatable by others in order to verify your findings.

tcp_rexmit_interval_min
default 200, recommended >= 1000 (200 for special purposes)
Since 8: default 400

After the initial retransmission further retransmissions will start after the tcp_rexmit_interval_min interval. BSD usually specifies 1500 milliseconds. This interval should be tuned to the value of tcp_rexmit_interval_initial, e.g. some value between 50 % up to 200 %. The parameter has no effect on retransmissions during an active open, see my accompanying document on retransmissions.

The tcp_rexmit_interval_min doesn't display any influence on connection establishment with Solaris 2.5.1. It does with 2.6, though. The influence on regular data retransmissions, or FIN retransmissions I have yet to research.

tcp_ip_abort_interval
default 120000, since 2.5 480000, recommended 600000

This interval specifies how long retransmissions for a connection in the ESTABLISHED state should be tried before a RESET segment is sent. BSD systems default to 9 minutes.

tcp_ip_abort_cinterval
default 240000, since 2.5 180000, recommended ?

This interval specifies how long retransmissions for a remote host are repeated until the RESET segment is sent. The difference to the tcp_ip_abort_interval parameter is that this connection is about to be established - it has not yet reached the state ESTABLISHED. This value is interesting considering SYN flood attacks on your server. Proxy server are doubly handicapped because of their Janus behavior (like a server towards the downstream cache, like a client towards the upstream server).

According to Stevens this interval is connected to the active open, e.g. the connect(3N) call. But according to SunSolve the interval has an impetus on both directions. A remote client can refuse to acknowledge an opening connection up to this interval. After the interval a RESET is sent. The other way around works out, too. If the three-way handshake to open a connection is not finished within this interval, the RESET Segment will be sent. This can only happen, if the final ACK went astray, which is a difficult test case to simulate.

To improve your SYN flood resistance, SUN suggests to use an interval as small as 10000 milliseconds. This value has only been tested for the "fast" networks of SUN. The more international your connection is, the slower it will be, and the more time you should grant in this interval. Proxy server should never lower this value (and should let Squid terminate the connection). Webservers are usually not affected, as they seldom actively open connections beyond the LAN.

tcp_rexmit_interval_max
default 60000, RFC 1122 recommends 240000 (2MSL), recommended 1...2 * tcp_close_wait_interval or tcp_time_wait_interval
Since 2.6: default 240000
Since 8: default 60000

All previously mentioned retransmissions related interval use an exponential backoff algorithm. The wait interval between two consecutive retransmissions for the same PDU is doubled starting with the minimum.

The tcp_rexmit_interval_max interval specifies the maximum wait interval between two retransmissions. If changing this value, you should also give the abort interval an inspection. The maximum wait interval should only be reached shortly before the abort interval timer expires. Additionally, you should coordinate your interval with the value of tcp_close_wait_interval or tcp_time_wait_interval.

tcp_deferred_ack_interval
default 50, BSD 200, recommended 200 (regular), 50 (benchmarking), or 500 (WAN server)
Since 8: default 100

This parameter specifies the timeout before sending a delayed ACK. The value should not be increased above 500, as required by RFC 1122. This value is of great interest for interactive services. A small number will increase the "responsiveness" of a remote service (telnet, X11), while a larger value can decrease the number of segments exchanged.

The parameter might also interest to HTTP servers which transmit small amounts of data after a very short retrieval time. With a heavy-duty servers or in laboratory banging environment, you might encounter service times answering a request which are well above 50 ms. An increase to 500 might lead to less PDUs transferred over the network, because TCP is able to merge the ACK with data. Increases beyond 500 should not be even considered.

SUN claims that Solaris recognizes the initial data phase of a connection. An initial ACK (not SYN) is not delayed. As opposed to the simplistic approach mentioned in the SUN paper, a request for a webservice (both, server or proxy) which does not fit into a single PDU can be transmitted faster. Also check the tcp_slow_start_initial Parameter.

The tcp_deferred_ack_interval also seems to be used to distinguish full-sized segments between interactive traffic and bulk data transfer. If a sender uses MSS sized segments, but sends each segment further apart than approximately 0.9 times the interval, the traffic will be rated interactive, and thus every segment seems to get ACKed.

tcp_deferred_acks_max
Since 2.6: default 8, recommended ?, maximum 16

This parameter features the maximum number of segments received after which an ACK just has to be sent. Previously I thought this parameter solely related to interactive data transfer, but I was mistaken. This parameter specifies the number of outstanding ACKs. You can give it a look when tuning for high speed traffic and bulk transfer, but the parameter is controversial. For instance, unless you employ selective acknowledgments (SACK) like Solaris 7, you can only ACK the number of segments correctly received. With the parameter at a larger value, statistically the amount of data to retransmit is larger.

Good values for retransmission tuning don't beam into existence from a white source. Rather you should carefully plan an experiment to get decent values. Intervals from another site can not be carried over to another Solaris system without change. But they might give you an idea where to start when choosing your own values.

The next part looks at a few parameters having to do with retransmissions, as well.

tcp_slow_start_initial
Since 2.5.1 with patch 103582-15 applied: default 1
Since 2.6: default 1, recommended 2 or 4 for servers
Since 8: default 4, no recommendations

This parameter provides the slow-start bug discovered in BSD and Windows TCP/IP implementations for Solaris. More information on the topic can be found on the servers of SUN and in Stevens [6]. To summarize the effect, a server starts sending two PDUs at once without waiting for an ACK due to wrong ACK counts. The ACK from connection initiation being counted as data ACK - compare with figure 2. Network congestion avoidance algorithms are being undermined. The slow start algorithm does not allow the buggy behavior, compare with RFC 2001.

Setting the parameter to 2 allows a Solaris machine to behave like it has the slow start bug, too. Well, IETF is said to make amends to the slow start algorithm, and the bug is now actively turned into a feature. SUN also warns:

It's still conceivable, although rare, that on a configuration that supports many clients on very slow-links, the change might induce more network congestions. Therefore the change of tcp_slow_start_initial should be made with caution.

[...]

Future Solaris releases are likely to default to 2.

You can also gain performance, if many of your clients are running old BSD or derived TCP/IP stacks (like MS). I expect new BSD OS releases not to figure this bug, but then I am not familiar with the BSD OS family. A reader of this page told me about cutting the latency of his server in half, just by using the value of 2.

If you want to know more about this feature and its behavior, you can have a look at some experiments I have conducted concerning that particular feature. The summary is that I agree with the reader: A BSDish client like Windows definitely profits from using a value of 2.

tcp_slow_start_after_idle
Since 2.6: default 2, no recommendations
Since 8: default 4, no recommendations

I reckon that this parameter deals with the slow start for an already established connection which was idle for some time (however the term idle is defined here).

tcp_dupack_fast_retransmit
default 3, no recommendations

Something to do with the number of duplicates ACKs. If we do fast retransmit and fast recovery algorithms, this many ACKs must be retransmitted until we assume that a segment has really been lost. A simple reordering of segments usually causes no more than two duplicate ACKs.

There are a couple of parameters which require some elementary familiarity with RFC 2001, which covers TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms, as well as ssthresh and cwnd.

tcp_rtt_updates
default 0, BSD 16, recommended: (see text)
Since 8: 20, no recommendations

This parameter controls when things like rtt_sa (the smoothed RTT), rtt_sd (the smoothed mean deviation), and ssthresh (the slow start threshold) are cached in the routing table. By default, Solaris does not cache any of the parameters. It is claimed that you can set it to a value you like, but to be the same as BSD, use 16.

The value to this parameter is the number of RTT samples that had to be sampled, so that an accurate enough value can be stored in the routing table. If you chose to use this feature, use a value of 16 or above. Using 16 allows the smoothed RTT filter to converge within 5 % of the correct value, compare Stevens [4], chapter 21.9.

ip_ire_cleanup_interval
default 30000, no recommendatations
Since 8: the parameter has a new name, just which one?

The parameters may do more than described here. If a routing table entry is not directly connected and not being used, the cache for things like rtt_sa, rtt_sd and ssthresh associated with the entry will be flushed after 30 seconds. The parameter tcp_rtt_updates must be greater than zero to enable the cache.

I could imagine that external helper programs invoked by MRTG on a regular basis connecting to a far-away host might benefit from increasing this value slightly above the invocation interval.

4. path MTU discovery

Whenever a connection is about to be established, the three-way handshake open negotiation, the segment size used will be set to the minimum of (a) the smallest MTU of an outgoing interface, and (b) from MSS announced by the peer. If the remote peer does not announce a MSS, usually the value 536 will be assumed. If path MTU discovery is active, all outgoing PDUs have the IP option DF (don't fragment) set.

If the ICMP error message fragmentation needed is received, a router on the way to the destination needed to fragment the PDU, but was not allowed to do so. Therefore the router discarded the PDU and did send back the ICMP error. Newer router implementations enclose the needed MSS in the error message. If the needed MSS is not included, the correct MSS must be determined by trial and error algorithm.

Due to the internet being a packet switching network, the route a PDU travels along a TCP virtual circuit may change with time. For this reason RFC 1191 recommends to rediscover the path MTU of an active connection after 10 minutes. Improvements of the route can only be noticed by repeated rediscoveries. Unfortunately, Solaris aggressively tries to rediscover the path MTU every 30 seconds. While this is o.k. for LAN environments, it is a grossly impolite behavior in WANs. Since routes may not change that often, aggressive repetitions of path MTU discoveries leads to unnecessary consumption of channel capacity and elongated service times.

Path MTU discovery is a far reaching and controversial topic when discussing it with local ISPs. Still, pMTU discovery is at the foundation of IPv6. The PSC tuning page argues pro path MTU discovery, especially if you maintain a high-speed or long-delay (e.g. satellite) link.

The recommendation I can give you is not to use the defaults of Solaris < 2.5. Please use path MTU discovery, but tune your system RFC conformant. You may alternatively want to switch off the path MTU discovery all together, though there are few situations where this is necessary.

I was made aware of the fact that in certain circumstances bridges connecting data link layers of differing MTU sizes defeat pMTU discovery. I have to put some more investigation into this matter. If a frame with maximum MTU size is to be transported into the network with the smaller MTU size, it is truncated silently. A bridge does not know anything about the upper protocol levels: A bridge neither fragments IP nor sends an ICMP error.

There may be work-arounds, and the tcp_mss_def is one of them. Setting all interfaces to the minimum shared MTU might help, at the cost of losing performance on the larger MTU network. Using what RFC 1122 calls an IP gateway is a possible, yet expensive solution.

ip_ire_pathmtu_interval
default 30000, recommended 600000
Since 2.5 600000, no recommendations

This timer determines the interval Solaris rediscovers the path MTU. An extremely large value will only evaluate the path MTU once at connection establishment.

ip_path_mtu_discovery
default 1, recommended 1

This parameter switches path MTU discovery on or off. If you enter a 0 here, Solaris will never try to set the DF bit in the IP option - unless your application explicitly requests it.

tcp_ignore_path_mtu
default 0, recommended 0

This is a debug switch! When activated, this switch will have the IP or TCP layer ignore all ICMP error messages fragmentation needed. By this, you will achieve the opposite of what you intended.

tcp_mss_def
default 536, recommended >= 536
Since 8: split into tcp_mss_def_ipv4 and tcp_mss_def_ipv6

This parameter determines the default MSS (maximum segment size) for non-local destination. For path MTU discovery to work effectively, this value can be set to the MTU of the most-used outgoing interface descreased by 20 byte IP header and 20 byte TCP header - if and only if the value is bigger than 536.

tcp_mss_def_ipv4
Since 8: default 536
tcp_mss_def_ipv6
Since 8: default 1460

Solaris 8 supports IPv6. Since IPv6 uses different defaults for the maximum segment size, one has to distinguish between IPv4 and IPv6. The default for IPv6 is close to what is said for tcp_mss_def.

5. Further advice, hints and remarks

This section covers a variety of topics, starting with various TCP timers which do not relate to previously mentioned issues. The next subsection throws a quick glance at some erratic behavior. The final section looks at a variety of parameters which deal with the reservation of resources.

Additionally, I strongly suggest the use of a file /etc/init.d/nettune (always called first script) which changes the tunable parameters. /etc/rcS.d/S31nettune is a hardlink to this file. The script will be executed during bootup when the system is in single user mode. A killscript is not necessary. The section about startup scripts below reiterates this topic in greater depth.

5.1 Common TCP timers

The current subsection covers three important TCP timers. First I will have a look at the keepalive timer. The timer is rather controversial, and some Solari implement them incorrectly. The next parameter limits the twice maximum segment lifetime (2MSL) value, which is connected to the time a socket spends in the TCP state TIME_WAIT. The final entry looks at the time spend in the TCP state FIN_WAIT_2.

tcp_keepalive_interval
default 7200000, minimum 10000, recommended 10000 <= x <= oo

This value is one of the most controversial ones when talking with other people about appropriate values. The interval specified with this key must expire before a keep-alive probe can be sent. Keep-alive probes are described in the host requirements RFC 1122: If a host chooses to implement keep-alive probes, it must enable the application to switch them on or off for a connection, and keep-alive probes must be switched off by default.

Keep-alives can terminate a perfectly good connection (as far as TCP/IP is concerned), cost your money and use up transmission capacity (commonly called bandwidth, which is, actually, something completely different). Determining whether a peer is alive should be a task of the application and thus kept on the application layer. Only if you run into the danger of keeping a server in the ESTABLISHED state forever, and thus using up precious server resources, you should switch on keep-alive probes.

Example for a webserver response

Figure 3: A typical handshake during a transaction.

Figure 3 shows the typical handshake during a HTTP connection. It is of no importance for the argumentation if the server is threaded, preforked or just plain forked. Webservers work transaction oriented as is shown in the following simplified description - the numbers do not relate to the figure:

  1. The client (browser) initiates a connection (active open).
  2. The client forwards its query (request).
  3. The server (daemon) answers (response).
  4. The server terminates the connection (active close).

Common implementations need to exchange 9..10 TCP segments per HTTP connection. The keep-alive option as a HTTP/1.0 protocol and extensions can be regarded as a hack. Persistent connections are a different matter, and not shown here. Most people still use HTTP/1.0, especially the Squid users.

The keep-alive timer becomes significant for webservers, if in step 1 the client crashed or terminates without the server knowing about it. This condition can be forced sometimes by quickly pressing the stop button of netscape or the Logo of Mosaic. Thus the keep-alive probes do make sense for webservers. HTTP Proxies look like a server to the browser, but look like a client to the server they are querying. Due to their server like interface, the conditions for webservers are true for proxies, as well.

With an implementation of keep-alive probes working correctly, a very small value can make sense when trying to improve webservers. In this case you have to make sure that the probes stop after a finite time, if a peer does not answer. Solari <= 2.5 have a bug and send keep-alive probes forever. They seem to want to elicit some response, like a RST or some ICMP error message from an intermediate router, but never counted on the destination simply being down. Is this fixed with 2.5.1? Is there a patch available against this misbehavior? I don't know, maybe you can help me.

I am quite sure that this bug is fixed in 2.6 and that it is safe to use a small value like ten minutes. Squid users should synchronize their cache configuration accordingly. There are some Squid timeouts dealing with an idle connection.

tcp_close_wait_interval
default 240000 (according to RFC 1122, 2MSL), recommended 60000, possibly lower
Since 7: obsoleted parameter, use tcp_time_wait_interval instead
Since 8: no more access, use tcp_time_wait_interval

Even though the parameter key contains "close_wait" in its name, the value specifies the TIME_WAIT interval! In order to fix this kind of confusion, starting with Solaris 7, the parameter tcp_close_wait_interval was renamed to the correct name tcp_time_wait_interval. The old key tcp_close_wait_interval still exists for backward compatibility reasons. User of Solari below 7 must use the old name tcp_close_wait_interval. Still, refer to tcp_time_wait_interval for an in-depth explaination.

tcp_time_wait_interval
Since 7: default 240000 (2MSL according to RFC 1122), recommended 60000, possibly lower

As Stevens repeatedly states in his books, the TIME_WAIT state is your friend. You should not desperately try to avoid it, rather try to understand it. The maximum segment lifetime(MSL) is the maximum interval a TCP segment may live in the net. Thus waiting twice this interval ensures that there are no leftover segments coming to haunt you. This is what the 2MSL is about. Afterwards it is safe to reuse the socket resource.

The parameter specifies the 2MSL according to the four minute limit specified in RFC 1122. With the knowledge about current network topologies and the strategies to reserve ephemerical ports you should consider a shorter interval. The shorter the interval, the faster precious resources like ephemerical ports are available again.

A toplevel search engine implementor recommends a value of 1000 millisecond to its customers. Personally I believe this is too low for regular server. A loaded search engine is a different matter alltogether, but now you see where some people start tweaking their systems. I rather tend to use a multiple of the tcp_rexmit_interval_initial interval. The current value of tcp_rexmit_interval_max should also be considered in this case - even though retransmissions are unconnected to the 2MSL time. A good starting point might be the double RTT to a very remote system (e.g. Australia for European sites). Alternatively a German commercial provider of my acquaintance uses 30000, the smallest interval recommended by BSD.

tcp_fin_wait_2_flush_interval
BSD 675000, default 675000, recommended 67500 (one zero less)

This values seems to describe the (BSD) timer interval which prohibits a connection to stay in the FIN_WAIT_2 state forever. FIN_WAIT_2 is reached, if a connection closes actively. The FIN is acknowledged, but the FIN from the passive side didn't arrive yet - and maybe never will.

Usually webservers and proxies actively close connections - as long as you don't use persistent connection and even those are closed from time to time. Apart from that HTTP/1.0 compliant server and proxies close connections after each transaction. A crashed or misbehaving browser may cause a server to use up a precious resource for a long time.

You should consider decreasing this interval, if netstat -f inet shows many connections in the state FIN_WAIT_2. The timer is only used, if the connection is really idle. Mind that after a TCP half close a simplex data transmission is still available towards the actively closing end. TCP half closes are not yet supported by Squid, though many web servers do support them (certain HTTP drafts suggest an independent use of TCP connections). Nevertheless, as long as the client sends data after the server actively half closed an established connection the timer is not active.

Sometimes, a Squid running on Solaris (2.5.1) confuses the system utterly. A great number of connection to a varying degree are in CLOSE_WAIT for reasons beyond me. During this phase the proxy is virtually unreachable for HTTP requests though, obnoxiously, it still answers ICP requests. Although lowering the value for tcp_close_wait_interval is only fixing symptoms indirectly, not the cause, it may help overcoming those periods of erratic behavior faster than the default. The thing needed would be some means to influence the CLOSE_WAIT interval directly.

5.2 Erratic IPX behavior

I noticed that Solari < 2.6 behave erratically under some conditions, if the IPX ethernet MTU of 1500 is used. Maybe there is an error in the frame assembly algorithm. If you limit yourself to the IEEE 802.3 MTU of 1492 byte, the problem does not seem to appear. A sample startup script with link in /etc/rc2.d can be used to change the MTU of ethernet interfaces after their initialization. Remember to set the MTU for every virtual interface, too!

Note, with a patched Solaris 2.5.1 or Solaris 2.6, the problem does not seem to appear. Limiting your MTU to non-standard might introduce problems with truncated PDUs in certain (admittedly very special) environments. Thus you may want to refrain from using the above mentioned script (always called second script in this document).

Since I observed the erratic behavior only in a Solaris 2.5, I believe it has been fixed with patch 103169-10, or above. The error description reads "1226653 IP can send packets larger than MTU size to the driver."

5.3 Common IP parameters

The following parameters have little impact on performance, nevertheless I reckon them worth noting here. Please note that parameters starting with the ip6 prefix apply to IPv6 while its twin with the ip applies to IPv4:

ip6_forwarding
Since 8: default 1, recommended 0 for pure server hosts or security
ip_forwarding
default 2, recommended 0 for pure server hosts or security
Since 8: default 1, recommended 0 for security reasons

If you intend to disable the routing abilities of your host all together, because you know you don't need them, you can set this switch to 0. The default value of 2 was only available in older versions of Solaris. It activates IP forwarding, if two or more real interfaces are up. The value of 1 in Solari < 8 activates IP forwarding regardless of the number of interfaces. With the possible exception of MBone routers and firewalling, you should leave routing to the dedicated routing hardware.

Starting with Solaris 8, the parameter set is split. You use ip_forwarding and ip6_forwarding to overall switch on forwarding of IPv4 and IPv6 PDU respectively between interfaces. The interfaces participating in forwarding can be activated separately, see if:ip_forwarding. Unless you host is acting as router, it is still recommended for security reasons to switch off any forwarding between interfaces.

if:ip_forwarding
Since 8: default 0, maximum 1, recommended 0

Please replace the if part of the parameter name with the appropriate interface available on your system, e.g. hme0 or hme0. Look into the available /dev/ip parameters, if unsure what interfaces are known to the IP stack.

Starting with Solaris 8, a subset of interfaces participating in IP forwarding can be selected by setting the appropriate parameter to 1. You also need to set the ip6_forwarding and ip_forwarding parameter, if you want to forward IPv6 or IPv6 respectively.

For security reasons, and in many environments, forwarding is not recommended.

ip6_forward_src_routed
Since 8: default 1, recommended 0 for security reasons
ip_forward_src_routed
default 1, recommended 0 for security reasons

This parameter determines if IP datagrams can be forwarded which have the source routing option activated. The parameter has little meaning for performance but is rather of security relevance. Solaris may forward such datagrams, if the host route option is activated, bypassing certain security construct - possibly undermining your firewall. Thus you should disable it always, unless the host functions as a regular router (and no other services).

If you enabled IPv6 forwarding or IPv4 forwarding, the *_forward_src_routed parameters may relate to forwarding.

ip_forward_directed_broadcasts
default 1, recommended 0 for pure server hosts or security

This switch decides whether datagrams directed to any of your direct broadcast addresses can be forwarded as link-layer broadcasts. If the switch is on (default), such datagrams are forwarded. If set to zero, pings or other broadcasts to the broadcast address(es) of your installed interface(s) are silently discarded. The switch is recommended for any host, but can break "expected" behavior.

ip6_respond_to_echo_multicast
Since 8: default 1, recommended 0 for security reasons
ip_respond_to_echo_broadcast
default 1, recommended 0 for security reasons

If you don't want to respond to an ICMP echo request (usually generated by the ping program) to any of your IPv4 broadcast or IPv6 multicast addresses addresses, set the matching parameter to 0. On one hand, responding to broadcast pings is rumored to have caused panics, or at least partial network meltdowns. On the other hand, it is a valid behavior, and often used to determine the number of alive hosts on a particular network. If you are dead sure that neither you nor your network admin will need this feature, you can switch it off by using the value of 0.

If you do not want to respond to any IPv4-broadcast or IPv6-multicast probes for security reasons, it is recommended to set the matching parameter to 0.

ip_icmp_err_burst
Since 8: default 10, min 1, maximum 99999, see text

ip_icmp_err_interval
default 500, recommended: see text

Solaris IP only generates ip_icmp_err_burst ICMP error messages in any ip_icmp_err_interval, regardless of IPv4 or IPv6. In order to protect from denial of service (DOS) attacks, the parameters do not need to be changed. Some administrators may need a higher error generation rate, and thus may want to decrease the interval or increase the generated message.

In versions of Solaris prior to 8, ip_icmp_err_interval used to define the minimum time between two consecutive ICMP error responses - as if in older versions the (by then not existing) ip_icmp_err_burst parameter had a value of 1. The generated ICMP responses include the time exceeded message as evoked by the traceroute command. If your current setting here is above the RTT of a traceroute probe, usually the second probe you see will time out.

If you set ip_icmp_err_burst to exactly 0, traceroute will not give away your host as running Solaris. Also, you switched of the rate limitation of ICMP messages, and are thus open to DOS attacks. Of course, there are other ways to determine which TCP/IP implementation a networked host is running.

ip6_icmp_return_data_bytes
Since 8: default 64, minimum 8, maximum 65520, no recommendations

ip_icmp_return_data_bytes
default 64, minimum 8, maximum 65520, no recommendations

The parameters control the number of bytes returned by any ICMP error message generated on this Solaris host. The default value 64 is sufficient for most cases. Some laboratory environments may want to temporarily increase the value in order to figure out problems with some network services.

ip6_send_redirects
Since 8: default 1, recommendation 0 for security reasons
ip_send_redirects
default 1, recommendation 0 for security reasons

These parameters control whether the IPv4 or IPv6 part of the IP stack send ICMP redirect messages. For security reasons, it is recommended to disable sending out such messages, unless your host is acting as router.

If you enabled IPv6 forwarding or IPv4 forwarding, the *_send_redirects parameters may relate to forwarding.

ip6_ignore_redirects
Since 8: default 0, recommendation 1 for security reasons
ip_ignore_redirect
default 0, recommendation 1 for security reasons

This flag control, if your routing table can be updated by ICMP redirect messages. Unless you run your host to act as router, it is recommended to disable this feature for security reasons. Otherwise, malicious external hosts may confuse your routing table.

If you enabled IPv6 forwarding or IPv4 forwarding, the *_ignore_redirects parameters may relate to forwarding.

ip_addrs_per_if
default 256, minimum 1, maximum 8192, no recommendations

This parameter limits the number of virtual interfaces you can declare per physical interface. Especially if you run Web Polygraph, you will need to increase the number of virtual interfaces available on your system.

ip6_strict_dst_multihoming
Since 8: default 0, recommended: see text
ip_strict_dst_multihoming
default 0, recommended: see text

According to RFC 1122, a host is said to be multihomed, if it has more than one IP address. Each IP address is assumed to be a logical interface. Different logical interfaces may map to the same physical interface. Physical interfaces may be connected to the same or different networks.

The strong end system model aka strict multihoming requires a host not to accept datagrams on physical interfaces to which to logical one is not bound. Outgoing datagrams are restricted to the interface which corresponds with the source ip address.

The weak end system model aka loose multihoming lets a host accept any of its ip addresses on any of its interfaces. Outgoing datagrams may be sent on any interface.

For security reasons, it is recommended to require strict multihoming, that is, setting the parameter to value 1. In certain circumstances, though, it may be necessary to disable strict multihoming, e.g. if the host is connected to a virtual private networks (VPN) or sometimes when acting as firewall.

For instance, I once maintained a setup, where a pair of related caching proxies were talking exclusively to each other via a crossover cable on one interface using private addresses while the other interface was connected to the public internet. In order to have them actually use the behind-the-scenes link, I had to manually set routes and disable strict multihoming.

5.4 TCP and UDP port related parameters

There are some parameters related to the ranges of ports associated with reserved access and non-privileged access. This section deals with the majority of useful parameters when selecting different than default port ranges.

udp_smallest_anon_port
tcp_smallest_anon_port
default 32768, recommended 8192

This value has the same size for UDP and TCP. Solaris allocates ephemerical ports above 32768. Busy servers or hosts using a large 2MSL, see tcp_close_wait_interval, may want to lower this limit to 8192. This yields more precious resources, especially for proxy servers.

A contra-indication may be servers and services running on well known ports above 8192. This parameter should be set very early during system bootup, especially before the portmapper is started.

The IANA port numbers document requires the assigned and/or private ports to start at 49152. For busy servers, severly limiting their ephemerical port supply in such a manner is not an option.

udp_largest_anon_port
default 65535, recommended: see text

This parameter has to be seen in combination with udp_smallest_anon_port. The traceroute program tries to reach a random UDP port above 32768 - or rather tries not to reach such a port - in order to provoke an ICMP error message from the host.

Paranoid system administrator may want to lower the value for this reason down to 32767, after the corresponding value for udp_smallest_anon_port has been lowered. On the other hand, datagram application protocols should be able to cope with foreign protocol datagrams.

If an ICP caching proxy or other UDP hyper-active applications are used, the lowering of this value can not be recommended. The respective TCP parameter tcp_largest_anon_port does not suffer this problem.

tcp_largest_anon_port
default 65535, no recommendations

The largest anonymous port for TCP should be the largest possible port number. There is no need to change this parameter.

udp_smallest_nonpriv_port
default 1024, no recommendations
tcp_smallest_nonpriv_port
default 1024, no recommendations

Privileged ports can only be bound to by the superuser. The smallest non-privileged port is the first port that a regular user can have his or her application to bind to.

tcp_extra_priv_ports_add
udp_extra_priv_ports_add
write-only action
tcp_extra_priv_ports_del
udp_extra_priv_ports_del
write-only action
tcp_extra_priv_ports
udp_extra_priv_ports
default (depends on active services)

The extra priviledged ports are those priviledged ports outside the scope of the reserved ports. Reserved port numbers are usually below 1024, see tcp_smallest_nonpriv_port for TCP and tcp_smallest_nonpriv_port for UDP, and require superuser privileges in order to bind to. For instance, if NFS is activated, the NFS server port 2049 is marked as privileged.

You can examine the extra privileged TCP port by looking at the read-only parameter tcp_extra_priv_ports. If you need to add an extra privileged port, use the tcp_extra_priv_ports_add with the port number as argument. If you need to remove an extra privileged port, use the tcp_extra_priv_ports_del action with the port number to remove as parameter. You can only add or remove one port at a time.

# ndd /dev/tcp tcp_extra_priv_ports
2049 
4045 
# ndd -set /dev/tcp tcp_extra_priv_ports_add 4444 5555
# ndd /dev/tcp tcp_extra_priv_ports
2049 
4045 
4444 
# ndd -set /dev/tcp tcp_extra_priv_ports_del 4444
# ndd /dev/tcp tcp_extra_priv_ports
2049 
4045 

Analogous procedures apply to UDP extra privileged port.

6. Windows, buffers and watermarks

This section is about windows, buffers and watermarks. It is still work in progress. The explanations available to me were very confusing (sigh), though the Stevens [7] helped to clear up a few things. If you have corrections to this section, please let me know and contribute to an update of the page. Many readers will thank you!

buffers and fragmentation while descending protocol layers

Figure 4: buffers and related issues

Here just a short trip through the network layer in order to explain what happens where. Your application is able to send almost any size of data to the transport layer. The transport layer is either UDP or TCP. The socket buffers are implemented on the transport layer. Depending on your choice of transport protocol, different actions are taken on this level.

TCP
All application data is copied into the socket buffer. If there is insufficient size, the application will be put to sleep. From the socket buffer, TCP will create segments. No chunk exceeds the MSS.

Only when the data was acknowledged from the peer instance, the data can be removed from the socket buffer! For slow connections or a slowly working peer, this implies a very long time some data uses up the buffer.

UDP
The socket buffer size of UDP is simply the maximum size of datagram UDP is able to transmit. Larger datagrams ought to elicit the EMSGSIZE error response from the socket layer. With UDP implementing an unreliable service, there is no need to keep the datagram in the socket buffer.

Please assume that there is not really a socket buffer for sending UDP. This really depends on the operating systems, but many systems copy the user data to some kernel storage area, whereas others try to eliminate all copy operations for the sake of performance.

Please note that for the reverse direction, that is receiving datagrams, UDP does indeed employ real buffering.

The IP layer needs to fragment chunks which are too large. Among the reasons TCP prechunks its segments is the need to avoid fragmentation. IP searches the routing tables for the appropriate interface in order to determine the fragment size and interface.

If the output queue of the datalink layer interface is full, the datagram will be discarded and an error will be returned to IP and back to the transport layer. If the transport protocol was TCP, TCP will try to resend the segment at a later time. UDP should return the ENOBUFS error, but some implementations don't.

To determine the MTU sizes, use the ifconfig -a command. The MTUs are needed for some calculation to be done later in this section. With IPv4 you can determine the MSS from the interface MTU by subtracting 20 Bytes for the TCP header and 20 Bytes for the IP header. Keep this in mind, as the calculation will be repeatedly necessary in the text following below.

$ ifconfig -a
lo0: flags=849<UP,LOOPBACK,RUNNING,MULTICAST> mtu 8232
        inet 127.0.0.1 netmask ff000000 
hme0: flags=863<UP,BROADCAST,NOTRAILERS,RUNNING,MULTICAST> mtu 1500
        inet 130.75.3.xxx netmask ffffff80 broadcast 130.75.3.255
ci0: flags=843<UP,BROADCAST,RUNNING,MULTICAST> mtu 9180
        inet 130.75.214.xxx netmask ffffff00 broadcast 130.75.214.255
        ether xx:xx:xx:xx:xx:xx
fa0: flags=842<BROADCAST,RUNNING,MULTICAST> mtu 9188
        inet 0.0.0.0 netmask 0 
        ether xx:xx:xx:xx:xx:xx
el0: flags=843<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 130.75.215.xxx netmask ffffff00 broadcast 130.75.215.255
        ether xx:xx:xx:xx:xx:xx

I removed the uninteresting things. hme0 is the regular 100 Mbps ethernet interface. The 10 Mbps ethernet interface is called le0. The el0 interface is an ATM LAN emulation (lane) interface. ci0 is the ATM classical IP (clip) interface. fa0 is the interface that supports Fore's proprietary implementation of native ATM. Fore is the vendor of the installed ATM card. AFAIK you can use this interface to build PVCs or, if you are also using Fore switches, SVCs. You see an unconfigured interface there.

The buffer sizes for sending and receiving TCP segment and for UDP datagrams can be tuned with Solaris. With the help of the netstat command you can obtain an output similar but unlike the following one. The data was obtained on a server which runs a Squid with five dnsserver children. Since the interprocess communication is accomplished via localhost sockets, you see both, the client side and the server side of each dnsserver child socket.

$ netstat -f inet

 TCP
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q  State
-------------------- -------------------- ----- ------ ----- ------ -------
blau-clip.ssh        challenger-clip.1023 57344     19 63980      0 ESTABLISHED
localhost.38437      localhost.38436      57344      0 57344      0 ESTABLISHED
localhost.38436      localhost.38437      57344      0 57344      0 ESTABLISHED
localhost.38439      localhost.38438      57344      0 57344      0 ESTABLISHED
localhost.38438      localhost.38439      57344      0 57344      0 ESTABLISHED
localhost.38441      localhost.38440      57344      0 57344      0 ESTABLISHED
localhost.38440      localhost.38441      57344      0 57344      0 ESTABLISHED
localhost.38443      localhost.38442      57344      0 57344      0 ESTABLISHED
localhost.38442      localhost.38443      57344      0 57344      0 ESTABLISHED
localhost.38445      localhost.38444      57344      0 57344      0 ESTABLISHED
localhost.38444      localhost.38445      57344      0 57344      0 ESTABLISHED

The columns titled with Swind and Rwind contain values for the size of the respective send- and reception windows, based on the free space available in the receive buffer at each peer. The Swind column contains the offered window size as reported by the remote peer. The Rwind column displays the advertised window size being transmitted to the remote peer.

An application can change the size of the the socket layer buffers with calls to setsockopt with the parameter SO_SNDBUF or SO_RCVBUF. Windows and buffers are not interchangeable. Just remember: The buffers have a fixed size - unless you use setsockopt to change. Windows on the other hand depend on the free space available in the input buffer. The minimum and maximum requirements for buffer sizes are tunable watermarks.

buffers, watermarks and windows

Figure 5: buffers, watermarks and window sizes.

Figure 5 shows the relation of the different buffers, windows and watermarks. I decided to let the send buffer grow from the maximum towards zero, which is just a way of showing things, and does probably not represent the real implementation. I left out the different socket options as the picture is confusing enough.

Squid users should note the following behavior seen with Solaris 2.6. The default socket buffer sizes which are detected during configuration phase are representative of the values for tcp_recv_hiwat, udp_recv_hiwat, tcp_xmit_hiwat and udp_xmit_hiwat. Also note that enabling the hit object feature still limits hit object size to 16384 byte, regardless of what your system is able to achieve.

Output from Squid 1.1.19 configuration script on a Solaris 2.6 host with the previously mentioned parameters all set to 64000. Please mind that these parameters do not constitute optimal sizes in most environments:

checking Default UDP send buffer size... 64000
checking Default UDP receive buffer size... 64000
checking Default TCP send buffer size... 64000
checking Default TCP receive buffer size... 64000

Buffers and windows are very important if you link via satellite. Due to the daterate possible but the extreme high round-trip delays of a satellite link, you will need very large TCP windows and possibly the TCP timestamp option. Only RFC 1323 conformant systems will achieve these ends. In other words, get a Solaris 2.6. For 2.5 systems, RFC 1323 compliance can be purchased as a Sun Consulting Special.

Window sizes are important for maximum throughput calculations, too. As Stevens [4] shows, you cannot go faster than the window size offered by your peer, divided by the round-trip time (RTT). The lower your RTT, the faster you can transmit. The larger your window, the faster you can transmit. If you intend to employ maximum window sizes, you might want to give tcp_deferred_acks_max another look.

The network research laboratory of the German research network did measurements on satellite links. The RTT for a 10 Mbps link (if I remember correctly) was about 500 ms. A regular system was able to transmit 600 kbps whereas a RFC 1323 conformant system was able to transmit about 7 Mbps. Only bulk data transfer will do that for you.

 (1)   10 Mbps * 0.5 s = 5 Mbit = 625 KB
 (2)   512 KB / 0.5 s = 1 MBps = 8 Mbps
 (3)   64 KB / 0.5 s = 128 KBps = 1 Mbps

The bandwidth-delay-product can be used to estimate the initial value when tweaking buffer sizes. The buffers then represent the capacity of the link. If we apply the bandwidth-delay-product calculations to the satellite link above, we get the following results: Equation 1 estimates the buffer sizes necessary to fully fill the 10 Mbps link. Equation 2 assumes that the buffer sizes were set to 512 KB, which would yield 8 Mbps. Slight deviation in the experiment may have been caused by retransmissions. Finally, equation 3 estimates the maximum datarate we can use on the satellite link, if limited to 64 KB buffers, e.g. Solaris <= 2.5.1. The 1 Mbps constitute an upper limit, as can be seen by the measured 600 Kbps.

Application developers, especially those for web-based applications, should be aware of the implications of persistent connections. As long as HTTP/1.0 connection-per-transaction style is used by your application, depending on the size of the transaction data, you will not get any decent transmissions via satellite. For instance, the average web object is about 13 KByte in size, thus transmitting such an object on a connection-per-transaction basis will never get past TCP slow start. While this may or may not be a big deal with terrestrial links, but you will never be able to fill a satellite pipe to a satisfactorily degree. Doing things in parallel might help. Only when reaching TCP congestion avoidance you will see any filling of the pipe. You might also want to check out the unrelated tcp_slow_start_initial parameter.

A word of caution seems to be in order, when tuning the Solaris' TCP high watermarks: Starting with Solaris 2.6, setting tcp_xmit_hiwat or tcp_recv_hiwat near 65535 may have the side effect of turning on the wscale option, because these values are rounded up to multiples of MTU for each connection. In some cases you may not want to accidentally use wscale, because it may break something else in your setup such as IP-Filter. To avoid accidentally using wscale, you need to make sure that tcp_xmit_hiwat and tcp_recv_hiwat are both at least 1 MTU below 65535. For ethernet interfaces, 64000 is a good choice.

tcp_cwnd_max
default 32768, since 2.? 65535, recommended 65535 for Solaris <= 2.5.1
since 2.6: 262144 (finally!), no recommendations
Since 8: 1048576, no recommendations

This parameter describes the maximum size the congestion window can be opened. The congestion window is opened as large as possible with any Solaris up to 2.5.1. A change to this value is only necessary for older Solaris systems, which defaulted to 32768. The Solaris 2.6 default looks reasonable, but you might need to increase this further for satellite or long, fast links.

Though window sizes beyond 64k are possible, mind that the window scale option is only announced during connection creation and your maximum windows size is 1 GByte (1,073,725,440 Byte). Also, the window scale option is only employed during the connection, if both sides support it.

tcp_recv_hiwat
default 8192, recommended 16384 (see text), Cockroft 32768, maximum 65535
Solaris 2.6 LFN bulk data transfer 131071 or above (see text)
Since 8: 24576 (see text)

This parameter determines the maximum size of the initial TCP reception buffer. The specified value will be rounded up to the next multiple of the MSS. From the free space within the buffer the advertised window size is determined. That is, the size of the reception window advertised to the remote peer. Squid users will be interested in this value with regards to the socket buffer size the Squid auto configuration program finds.

The previous table shows an Rwind value of 63980 = 7 * 9140. 9140 is the MSS of the ATM classical IP interface (clip) in host blau. The interface itself uses a MTU of 9180. For the standard builtin 10 Mbps or 100 Mbps IPX ethernet, you get a MTU of 1500 on the outgoing interface, which yields an MSS of 1460. The value of 57344 in the next Rwind line points to the lo0 (loopback) interface, MTU 8232, MSS 8192 and 57344 = 7 * 8192.

Starting with Solaris 2.6 values above 65535 are possible, see the window scale option from RFC 1323. Only if the peer host also implements RFC 1323, you will benefit from buffer sizes above 65535. If one host does not implement the window scale option, the window is still limited to 64K. The option is only activated, if buffer sizes above 64K are used.

For HTTP, I don't see the need to increase the buffer above 64k. Imagine servicing 1024 simultaneous connections. If both the TCP high watermarks of your system are tuned to 64k and your application uses the system's defaults, you would need 128M just for your TCP buffers!

Squid's configuration option tcp_recv_bufsize lets you select a TCP receive buffer size, but if set to 0 (default) the kernel value will be taken, which is configurable with the tcp_recv_hiwat parameter. A buffer size of 16K is large enough to cover over 70 % of all received webobjects on our caches.

Refer to tcp_host_param for a way to configure special defaults for a set of hosts and networks.

tcp_recv_hiwat_minmss
default 4, no recommendations

This parameter influences the minimum size of the input buffer. The reception buffer is at least as large as this value multiplied by the MSS. The real value is the maximum of tcp_recv_hiwat round up to the next MSS and tcp_recv_hiwat_minmss multiplied by the MSS, in other words, something akin to:

  hiwat_tmp ~= ceil( tcp_recv_hiwat / MSS )
  real_size := MAX( hiwat_tmp, tcp_recv_hiwat_minmss ) * MSS

That way, however bad you misconfigure the buffers, there is a guaranteed space for tcp_recv_hiwat_minmss full segments in your input buffer.

udp_recv_hiwat
default 8192, recommended 16384 (see text), maximum 65535

The highwater mark for the UDP reception buffer size. This value may be of interest for Squid proxies which use ICP extensively. Please read the explanations for tcp_recv_hiwat. Squid users will want at least 16384, especially if you are planning on using the (obsolete) hit object feature of Squid. A larger value lets your computer receive more seemingly simultaneous ICP PDUs.

If you see many dead parent detections in your cache.log file without cause, you might want to increase the receive buffer. In most environments an increase to 64000 will have a negligible effect on the memory consumption, as most application, including Squid, use only one or very few UDP sockets, and often in an iterative way.

Remember if you don't set your socket buffer explicitly with a call to setsockopt(), your default reception buffer will have about the mentioned size. Arriving Datagrams of a larger size might be truncated or completely rejected. Some systems don't even notify your receiving application.

tcp_xmit_hiwat
default 8192, recommended 16384 (see text), Cockroft 32768, maximum 65535
Solaris 2.6 LFN bulk data transfer 131071 or above (see text)
Since 8: 16384 (see text)

This parameter influence a heuristic which determines the size of the initial send window. The actual value will be rounded up to the next multiple of the MSS, e.g. 8760 = 6 * 1460. Also do read the section on tcp_recv_hiwat.

The table further to the top shows a Swind of 57344 = 7 * 8192. For the standard builtin 10 Mbps or 100 Mbps IPX ethernet, you get an MTU of 1500 on the outgoing interface, which yields a MSS of 1460.

Starting with Solaris 2.6 values above 65535 are possible, see the window scale option from RFC 1323. Only if the peer host also implements RFC 1323, you will benefit from buffer sizes above 65535. If one host does not implement the window scale option, the window is still limited to 64K.

I don't see the need to increase the buffer above 32K for HTTP applications. Imagine servicing 1024 simultaneous connections. If both TCP high watermarks of your system are tuned to 32K, you would need 64M just for your TCP buffers. Mind that the send buffer has to keep a copy of all unacknowledged segments. Therefore it is affordable to give it a greater size than the receive buffer. Again, 16K covers over 70 % of all transferred web objects on our caches, and 32K should cover 90 %.

Refer to tcp_host_param for a way to configure special defaults for a set of hosts and networks.

udp_xmit_hiwat
default 8192, recommended 16384, maximum 65535

This refers to the highwater mark for send buffers. May be of interest for proxies using ICP extensively. Please refer to the explanations for tcp_xmit_hiwat. Squid users will want 16384, especially if you are planning on using the hit object feature of Squid. Selecting a higher value for the transmission is not feasible.

Please remember that there exists no real send buffer for UDP on the socket layer. Thus, trying to send a larger amount of data than udp_xmit_hiwat will truncate the excess, unless the SO_SNDBUF socket option was used to extend the allowed size.

tcp_xmit_lowat
default 2048, no recommendations
Since 8: 4096, no recommendations

The current parameter refers to the amount of data which must be available in the TCP socket sendbuffer until select or poll return writable for the connected file descriptor.

Usually there is no need to tune this parameter. Applications can use the socket option SO_SNDLOWAT to change this parameter on a process local basis.

udp_xmit_lowat
default 1024, no recommendations

The current parameter refers to the amount of data which must be available until select or poll return writable for the connected file descriptor. Since UDP does not need to keep datagrams and thus needs no outgoing socket buffer, the socket will always be writable as long as the socket sendbuffer size value is greater than the low watermark. Thus it does not really make much sense to wait for a datagram socket to become writable unless you constantly adjust the sendbuffer size.

Usually there is no need to tune this parameter, especially not on a system-wide basis.

tcp_max_buf
default 262144, minimum 65536, no immediate recommendations
since 2.6 1048576, minimum 65536, no immediate recommendations
udp_max_buf
default 262144 (since 2.5), minimum 65536, no immediate recommendations

Finally found the explanations in the SUN TCP/IP Admin Guide. The current parameter refers to the maximum buffer size an application is allowed to specify with the SO_SNDBUF and SO_RCVBUF socket option calls. Attempts to use larger buffers will fail with a EINVAL return code from the socket option call. SUN recommends to use only the largest buffer necessary for any of your applications - that is, the supremum function, not the sum. Specifying a greater size does not seem to have much impact, if all your applications are well-behaving. If not, they may consume quite an amount of kernel memory, thus this parameter is also a kind of safety line.

A few odd remarks at this point, concerning the recommendations given for the transmission buffer sizes. I decreased the recommendations of Adrian Cockroft in favor of a more conservative memory consumption. Also, with an average HTTP object size of 13 KByte, you can expect to fit over 50 % of all objects into the transmission buffer. On the other hand, larger objects which are to be transmitted by a cache or webserver may suffer in certain circumstances. Furthermore, I should recommend a generic transmission buffer size which is double the reception buffer size. This recommendation bases on the fact that unacknowledged segments occupy the send buffer until they are acknowledged.

Here some more material from the SUN TCP/IP Admin Guide, kindly pointed out by Mr. Murphy. Refer to the SUN guide for a more detailed description of these parameters, and their respective applicability. Most noteworthy is tcp_host_param, which allows per host/network defaults regarding RFC 1323 TCP options.

tcp_wscale_always
Since 2.6: default 0

If the parameter is set (non-zero), then the TCP window scale option will always be negotiated during connection initiation. Otherwise, the scale option will only be used if the buffer size is above 64K. To take effect, both hosts have to support RFC 1323.

tcp_tstamp_always
Since 2.6: default 0

If the parameter is set (non-zero), then the TCP timestamp option will always be negotiated during connection initiation. The scale option will always be used if the remote system sent a timestamp option during connection initiation. To use the timestamp, both hosts have to support RFC 1323.

tcp_tstamp_if_wscale
Since 2.6: default 0

If the option is set (non-zero), the TCP timestamp option will be used in addition to the TCP window scale option, if the user has requested a buffer size above 64K, that is, if window scaling is active.

tcp_host_param_ipv6
Since 8: default is empty (this is a tabular value)

Refer to tcp_host_param for instructions on handling the table. The same rules apply except that the ipv6 table is meant for IPv6, of course.

tcp_host_param
Since 2.6: default is empty (this is a tabular value)

This parameter represents a table which contains special TCP options to be used with a remote host or network. The table is configurable with the help of ndd, and empty by default. The following piece of code displays the contents of the table at various points, sets an entry and removes it again:

# ndd /dev/tcp tcp_host_param
Hash HSP      Address         Subnet Mask     Send       Receive    TStamp

# ndd -set /dev/tcp tcp_host_param '192.168.4.17 sendspace 262144 recvspace 262144'
# ndd /dev/tcp tcp_host_param
Hash HSP      Address         Subnet Mask     Send       Receive    TStamp
 125 62bae844 192.168.004.017 000.000.000.000 0000262144 0000262144      0

# ndd -set /dev/tcp tcp_host_param '192.168.4.17 delete'
# ndd /dev/tcp tcp_host_param
Hash HSP      Address         Subnet Mask     Send       Receive    TStamp

Use the mask command to supply a netmask for a network, and the timestamp command to supply the timestamp option. Fill this table from a startup script, if you want large default windows only for certain links (e.g. which go via satellite), but small windows for anything else. The content of this table takes precedence over the generic global values, if certain criteria are met:

7. Tuning your system

This section evolved around tuning items, which were not directly related to the TCP/IP stack, but nevertheless play an important role in the tuning of any system. Refer to SUN's Solaris Tunable Reference Manual for more in-depth information.

7.1 Things to watch

Did you reserve enough swap space? You should have at least as much swap as you have main memory. If you have little main memory, even double your swap. Do not be fooled by the result of the vmstat command - read the manpage and realize that the small value for free memory shown there is (usually) correct.

With Solaris there seems to exist a difference between virtually generated processes and real processes. The latter is extremely dependent on the amount of virtual memory. To test the amount of both kinds of processes, try a small program of mine. Do start it at the console, without X and not as privileged user. The first value is the hard limit of processes, and the second value the amount of processes you can really create given your virtual memory configuration. Tweaking your ulimit values may or may not help.

7.2 General entries in the file /etc/system

The file /etc/system contains various very important resource configurable parameters for your system. You use these tunings to give a heavily loaded system more resources of a certain kind. Unfortunately a reboot is necessary after changing anything. Though one could schedule reboots after midnight, I advice against it. You should always check if your changes have the desired effect, and won't tear down the system.

Adrian Cockroft severely warns against transporting an /etc/system from one system onto another, even worse, onto another hardware platform:

Clean out your /etc/system when you upgrade.

The most frequent changes are limited to the number of file descriptors, because the socket API uses file descriptors for handling internet connectivity. You may want to look at the hard limit of filehandles available to you. Proxies like Squid have to count twice to thrice for each request: open request descriptors and an open file and/or (depending what squid you are using) an open forwarding request descriptors. Similar calculations are true for other caches.

You are able to influence the tuning with the reserved word set. Use a whitespace to separate the key from the keyword. Use an equals sign to separate the value from its key. There are a few examples in the comments of the file.

Please, before you start, make a backup copy of your initial /etc/system. The backup should be located on your root filesystem. Thus, if some parameters fail, you can always supply the alternative, original system file on the boot prompt. The following shows two typically entered parameters:

* these are the defaults of Solaris < 8
set rlim_fd_max=1024
set rlim_fd_cur=64

WARNING! SUN does not make any guarantees for the correct working of your system, if you use more file descriptors than 4096. Personally, my old fvwm window manager did quit working alltogether. In my case, I compiled it on a Solaris 2.3 or 2.4 system and transferred it always onwards to a 2.5 system. After re-compiling it on the new OS, it worked to my satisfaction.

If you experience SEGV core dumps from your select(3c) system call after increasing your file descriptors above 4096, you have to recompile the affected programs. Especially the select(3c) call is known to the Squid users for its bad tempers concerning the maximum number of file descriptors. SUN remarks to this topic:

The default value for FD_SETSIZE (currently 1024) is larger than the default limit on the number of open files. In order to accommodate programs that may use a larger number of open files with select(), it is possible to increase this size within a program by providing a larger definition of FD_SETSIZE before the inclusion of <sys/types.h>.

Note: This does not work as expected! See text below.

I did test this suggestion by SUN, and a friend of mine tried it with Squid Caches. The result was a complete success or disaster both times, depending on your point of view: If you can live with supplying naked women to your customers instead of bouncing logos of companies, go ahead and try it. If you really need to access file descriptors above 1024, don't use select(), use poll() instead! poll() is supposed to be faster with Solaris, anyway. A different source mentions that the redefinition workaround mentioned above works satisfactorily; not for me, my personal experiences warn against such an action.

At the pages of VJ are a some tricks which I incorporated into this paper, too. Personally I am of the opinion that the VJ pages are not as up to date as they could be.

Many parameters of interest can be determined using the sysdef -i command. Please keep in mind that many values are in hexadecimal notation without the 0x prefix. Another very good program to see your system's configuration is sysinfo, the program. Refer to the manpages how to invoke this program.

There is also the possibility to use a small helper script kindly supplied by Mr. Kroonma to have a look into some kernel variables with the help of the absolute debugger (adb). You can extend the script to suit your own needs, but you should know what you are doing. Refer to the manual page of the absolute debugger for details of displaying non-ulong datatype variables. If you don't know, what adb can do for you, hands off.

rlim_fd_cur
default 64, recommended 64 or 256
Since 8: default 256, no recommendations

This parameters defines the soft limit of open files you can have. The currently active soft limit can be determined from a shell with something like

ulimit -Sn

Use at your own risk values above 256, especially if you are running old binaries. A value of 4096 may look harmless enough, but may still break old binaries.

Another source mentions that using more than 8192 file descriptors is discouragable. It mentions that you ought to use more processes, if you need more than 4096 file descriptors. On the other hand, an ISP of my acquaintance is using 16384 descriptors to his satisfaction.

The predicate rlim_fd_cur <= rlim_fd_max must be fulfilled.

Please note that Squid only cares about the hard limit (next item). With respect to the standard IO library, you should not raise the soft limit above 256. Stdio can only use <= 256 FDs. You can either use AT&T'ssfio library, or use Solaris 64-bit mode applications which fix the stdio weakness. RPC prior to 2.6 may break, if more than 1024 FDs are available to it.

Also note that RPC prior to Solaris 2.6 may break, if more than 1024 FDs are available to it. Also, setting the soft limit to or above 1024 implies that your license server queries break (first hand experience - thanks Jens). Using 256 is really a strong recommendation.

rlim_fd_max
default 1024, recommended >=4096

This parameter defines the hard limit of open files you can have. For a Squid and most other servers, regardless of TCP or UDP, the number of open file descriptors per user process is among the most important parameter. The number of file descriptors is one limit on the number of connections you can have in parallel. You can find out the value of your hard limit on a shell with something like

ulimit -Hn

You should consider a value of at least 2 * tcp_conn_req_max and you should provide at least 2 * rlim_fd_cur. The predicate rlim_fd_cur <= rlim_fd_max must be fulfilled.

Use at your own risk values above 1024. SUN does not make any warranty for the workability of your system, if you increase this above 1024. Squid users of busy proxies will have to increase this value, though. A good starting seems to be 16384 <= x <= 32768. Remember to change the Makefile for Squid to use poll() instead of select(). Also remember that each call of configure will change the Makefile back, if you didn't change Makefile.in.

Any decent application will incorporate code to increase its soft limit to a possibly higher hard limit. Please note (again) that Squid, as such an application, only cares about the hard limit.

maxphys
default 126976 (sun4m and sun4d), 131072 (sun4u), 57,344 (Intel),
1048576 (sd driver with wide-SCSI), 1048576 (SPARC storage array driver), no recommendations

A work-copy of this value is often stored in the mount structure or driver structure at the time it is attached. If a driver sees IO requests larger than this parameter, the requests will be broken down into appropriotely sized chunks. The file system may further fragment the chunks. A change might be conceivable, if your database server uses raw devices and issues large requests - mind that many of todays database usage paradigms result in many small chunked requests and will not speed up by increasing this value.

If working large chunked IO with UFS, you can additionally increase the number of cylinder groups and decrease the number of inodes per group (as there will be a few large files).

maxusers
default 249 ~= Megs RAM (Ultra-2/2 CPUs/256 MB), min 8, max 2048, no recommendations

This parameter determines the size of certain kernel data structures which are initialized at startup. Recent versions of Solaris derive most table sizes now from the amount of memory available, but there are still some dependent variables on this parameter, see max_nprocs, maxuprc, ufs_ninode, ncsize and ndquot. There is strong indication that the default for maxusers itself is being determined from the main memory in megs. It might also be a function of the available memory and/or architecture.

The greater you chose the number for maxusers, the greater the number of the mentioned resources. The relation in strictly proportional: A doubling of maxusers will (more or less) double the other resources.

Adrian Cockroft advises against a setting of maxusers. The kernel uses a lot of space while keeping track of the RAM usages within the system, therefore it might need to be reduced on systems with gigabytes of main memory. The point to change this parameter is whenever the automagically determined number of user processes is way too high, e.g. file servers, database servers, compute servers with few processes, or way too low.

pidmax
Since 8: default 30000, minimum 266, maximum 999999, no recommendations

Starting with Solaris 8, you can determine the number of the largest possible value for a pid_t the system can set. From this parameter, the kernel variable maxpid will be set once during startup. maxpid on the other hand cannot be set via /etc/system.

reserved_procs
Since 8: default 5, mininum 5, maximum MAXINT, no recommendations

This parameter is the mysterious difference between the number of all processes max_nprocs and the number of user processes maxuprc, and affects the number of system process table slots reserved for uid 0, e.g. sched, pageout and fsflush.

Though a change is not immanently recommended, increasing the number of root slots to 10 plus number of root processes might be considered, in order to provide root with a shell at times the system is uncapable of creating a user-level shell, e.g. run-away user-processes, fork-of-death, etc.

max_nprocs
default 10+maxusers*16, minimum 266, maximum MIN(maxpid,65534), no recommendations

This is the systemwide number of processes available, user and system processes. You should leave sufficient space to the parameter maxuprc. The value of this parameter is influenced by the setting of maxusers.

The number is used to compute various further parameters (see below), including the DNLC cache, the quota structures, System-V semaphore limits, address translation table resources for sun4m, sun4d and Intel Solaris verions.

maxuprc
default