diff options
author | Matt Turner <mattst88@gmail.com> | 2008-11-11 21:27:09 +0000 |
---|---|---|
committer | Matt Turner <mattst88@gmail.com> | 2008-11-11 21:27:09 +0000 |
commit | 91b4edf69e5adf9c40edcaff38b880218b7b0d9d (patch) | |
tree | 85b654192fa2ce167f4899f6f0a4ca14b1acdb13 /bdoc.txt |
Initial Import
git-svn-id: svn://mattst88.com/svn/cleanbench/trunk@1 0d43b9a7-5ab2-4d7b-af9d-f64450cef757
Diffstat (limited to 'bdoc.txt')
-rw-r--r-- | bdoc.txt | 2109 |
1 files changed, 2109 insertions, 0 deletions
diff --git a/bdoc.txt b/bdoc.txt new file mode 100644 index 0000000..e557bb0 --- /dev/null +++ b/bdoc.txt @@ -0,0 +1,2109 @@ +http://www.byte.com/bmark/bmark.htm +---------------------------------------------------------------------------- + +BYTEmark + +---------------------------------------------------------------------------- + +This is release 2 of BYTE Magazine's BYTEmark benchmark program (previously +known as BYTE's Native Mode Benchmarks). This document covers the Native +Mode (a.k.a. Algorithm Level) tests; benchmarks designed to expose the +capabilities of a system's CPU, FPU, and memory system. Another group of +benchmarks within the BYTEmark suite includes the Application Simulation +Benchmarks. They are detailed in a separate document. [NOTE: The +documentation for the Application simulation benchmarks should appear before +the end of March, 95. -- RG]. + +The Tests + +The Native Mode portion of the BYTEmark consists of a number of well-known +algorithms; some BYTE has used before in earlier versions of the benchmark, +others are new. The complete suite consists of 10 tests: + +Numeric sort - Sorts an array of 32-bit integers. + +String sort - Sorts an array of strings of arbitrary length. + +Bitfield - Executes a variety of bit manipulation functions. + +Emulated floating-point - A small software floating-point package. + +Fourier coefficients - A numerical analysis routine for calculating series +approximations of waveforms. + +Assignment algorithm - A well-known task allocation algorithm. + +Huffman compression - A well-known text and graphics compression algorithm. + +IDEA encryption - A relatively new block cipher algorithm. + +Neural Net - A small but functional back-propagation network simulator. + +LU Decomposition - A robust algorithm for solving linear equations. + +A more complete description of each test can be found in later sections of +this document. + +BYTE built the BYTEmark with the multiplatform world foremost in mind. There +were, of course, other considerations that we kept high on the list: + +Real-world algorithms. The algorithms should actually do something. Previous +benchmarks often moved gobs of bytes from one point to another, added or +subtracted piles and piles of numbers, or (in some cases) actually executed +NOP instructions. We should not belittle those tests of yesterday, they had +their place. However, we think it better that tests be based on activities +that are more complex in nature. + +Easy to port. All the benchmarks are written in "vanilla" ANSI C. This +provides us with the best chance of moving them quickly and accurately to +new processors and operating systems as they appear. It also simplifies +maintenance. + +This means that as new 64-bit (and, perhaps, 128-bit) processors appear, the +benchmarks can test them as soon as a compiler is available. + +Comprehensive. The algorithms were derived from a variety of sources. Some +are routines that BYTE had been using for some time. Others are routines +derived from well-known texts in the computer science world. Furthermore, +the algorithms differ in structure. Some simply "walk" sequentially through +one-dimensional arrays. Others build and manipulate two-dimensional arrays. +Finally, some benchmarks are "integer" tests, while others exercise the +floating-point coprocessor (if one is available). + +Scalable. We wanted these benchmarks to be useful across as wide a variety +of systems as possible. We also wanted to give them a lifetime beyond the +next wave of new processors. + +To that end, we incorporated "dynamic workload adjustment." A complete +description of this appears in a later section. In a nutshell, this allows +the tests to "expand or contract" depending on the capabilities of the +system under test, all the while providing consistent results so that fair +and accurate comparisons are possible. + +Honesty In Advertising + +We'd be lying if we said that the BYTEmark was all the benchmarking that +anyone would ever need to run on a system. It would be equally inaccurate to +suggest that the tests are completely free of inadequacies. There are many +things the tests do not do, there are shortcomings, and there are problems. + +BYTE will continue to improve the BYTEmark. The source code is freely +available, and we encourage vendors and users to examine the routines and +provide us with their feedback. In this way, we assure fairness, +comprehensiveness, and accuracy. + +Still, as we mentioned, there are some shortcomings. Here are those we +consider the most significant. Keep them in mind as you examine the results +of the benchmarks now and in the future. + +At the mercy of C compilers. Being written in ANSI C, the benchmark program +is highly portable. This is a reflection of the "world we live in." If this +were a one-processor world, we might stand a chance at hand-crafting a +benchmark in assembly language. (At one time, that's exactly what BYTE did.) +Not today, no way. + +The upshot is that the benchmarks must be compiled. For broadest coverage, +we selected ANSI C. And when they're compiled, the resulting executable's +performance can be highly dependent on the capabilities of the C compiler. +Today's benchmark results can be blown out of the water tomorrow if someone +new enters the scene with an optimizing strategy that outperforms existing +competition. + +This concern is not easily waved off. It will require you to keep careful +track of compiler version and optimization switches. As BYTE builds its +database of benchmark results, version number and switch setting will become +an integral part of that data. This will be true for published information +as well, so that you can make comparisons fairly and accurately. BYTE will +control the distribution of test results so that all relevant compiler +information is attached to the data. + +As a faint justification -- for those who think this situation results in +"polluted" tests -- we should point out that we are in the same boat as all +the other developers (at least, all those using C compilers -- and that's +quite a sizeable group). If the only C compilers for a given system happen +to be poor ones, everyone suffers. It's a fact that a given platform's +ultimate potential depends as much on the development software available as +on the technical achievements of the hardware design. + +It's just CPU and FPU. It's very tempting to try to capture the performance +of a machine in a single number. That has never been possible -- though it's +been tried a lot -- and the gap between that ideal and reality will forever +widen. + +These benchmarks are meant to expose the theoretical upper limit of the CPU, +FPU, and memory architecture of a system. They cannot measure video, disk, +or network throughput (those are the domains of a different set of +benchmarks). You should, therefore, use the results of these tests as part, +not all, of any evaluation of a system. + +Single threaded. Currently, each benchmark test uses only a single execution +thread. It's unlikely that you'll find any modern operating system that does +not have some multitasking component. How a system "scales" as more tasks +are run simultaneously is an effect that the current benchmarks cannot +explore. + +BYTE is working on a future version of the tests that will solve this +problem. + +The tests are synthetic. This quite reasonable argument is based on the fact +that people don't run benchmarks for a living, they run applications. +Consequently, the only true measure of a system is how well it performs +whatever applications you will be running. This, in fact, is the philosophy +behind the BAPCo benchmarks. + +This is not a point with which we would disagree. BYTE regularly makes use +of a variety of application benchmarks. None of this suggests, however, that +the BYTEmark benchmarks serve no purpose. + +BYTEmark's results should be used as predictors. They can be moved to a new +platform long before native applications will be ported. The BYTEmark +benchmarks will therefore provide an early look at the potential of the +machine. Additionally, the BYTEmark permits you to "home in" on an aspect of +the overall architecture. How well does the system perform when executing +floating-point computations? Does its memory architecture help or hinder the +management of memory buffers that may fall on arbitrary address boundaries? +How does the cache work with a program whose memory access favors moving +randomly through memory as opposed to moving sequentially through memory? + +The answers to these questions can give you a good idea of how well a system +would support a particular class of applications. Only a synthetic benchmark +can give the narrow view necessary to find the answers. + +Dynamic Workloads + +Our long history of benchmarking has taught us one thing above all others: +Tomorrow's system will go faster than today's by an amount exceeding your +wildest guess -- and then some. Dealing with this can become an unending +race. + +It goes like this: You design a benchmark algorithm, you specify its +parameters (how big the array is, how many loops, etc.), you run it on +today's latest super-microcomputer, collect your data, and go home. A new +machine arrives the next day, you run your benchmark, and discover that the +test executes so quickly that the resolution of the clock routine you're +using can't keep up with it (i.e., the test is over and done before the +system clock even has a chance to tick). + +If you modify your routine, the figures you collected yesterday are no good. +If you create a better clock routine by sneaking down into the system +hardware, you can kiss portability goodbye. + +The BYTEmark benchmarks solve this problem by a process we'll refer to as +"dynamic workload adjustment." In principle, it simply means that if the +test runs so fast that the system clock can't time it, the benchmark +increases the test workload -- and keeps increasing it -- until enough time +is consumed to gather reliable test results. + +Here's an example. + +The BYTEmark benchmarks perform timing using a "stopwatch" paradigm. The +routine StartStopwatch() begins timing; StopStopwatch() ends timing and +reports the elapsed time in clock ticks. Now, "clock ticks" is a value that +varies from system to system. We'll presume that our test system provides +1000 clock ticks per second. (We'll also presume that the system actually +updates its clock 1000 times per second. Surprisingly, some systems don't do +that. One we know of will tell you that the clock provides 100 ticks per +second, but updates the clock in 5- or 6-tick increments. The resolution is +no better than somewhere around 1/18th of a second.) Here, when we say +"system" we mean not only the computer system, but the environment provided +by the C compiler. Interestingly, different C compilers for the same system +will report different clock ticks per second. + +Built into the benchmarks is a global variable called GLOBALMINTICKS. This +variable is the minimum number of clock ticks that the benchmark will allow +StopStopwatch() to report. + +Suppose you run the Numeric Sort benchmark. The benchmark program will +construct an array filled with random numbers, call StartStopwatch(), sort +the array, and call StopStopwatch(). If the time reported in StopStopwatch() +is less than GLOBALMINTICKS, then the benchmark will build two arrays, and +try again. If sorting two arrays took less time than GLOBALMINTICKS, the +process repeats with more arrays. + +This goes on until the benchmark makes enough work so that an interval +between StartStopwatch() and StopStopwatch() exceeds GLOBALMINTICKS. Once +that happens, the test is actually run, and scores are calculated. + +Notice that the benchmark didn't make bigger arrays, it made more arrays. +That's because the time taken by the sort test does not increase linearly as +the array grows, it increases by a factor of N*log(N) (where N is the size +of the array). + +This principle is applied to all the benchmark tests. A machine with a less +accurate clock may be forced to sort more arrays at a time, but the results +are given in arrays per second. In this way fast machines, slow machines, +machines with accurate clocks, machines with less accurate clocks, can all +be tested with the same code. + +Confidence Intervals + +Another built-in feature of the BYTEmark is a set of statistical-analysis +routines. Running benchmarks is one thing; the question arises as to how +many times should a test be run until you know you have a good sampling. +Also, can you determine whether the test is stable (i.e., do results vary +widely from one execution of the benchmark to the next)? + +The BYTEmark keeps score as follows: Each test (a test being a numeric +sort, a string sort, etc.) is run five times. These five scores are +averaged, the standard deviation is determined, and a 95% confidence +half-interval for the mean is calculated (using the student t +distribution). This tells us that the true average lies -- with a 95% +probability -- within plus or minus the confidence half-interval of +the calculated average. If this half-interval is within 5% of the +calculated average, the benchmarking stops. Otherwise, a new test is +run and the calculations are repeated with all of the runs done so +far, including the new one. The benchmark proceeds this way up to a +total of 30 runs. If the length of the half-interval is still bigger +than 5% of the calculated average then a warning issued that the +results might not be statistically certain before the average is +displayed. + +** Fixed a statistical bug here. Uwe F. Mayer + +The upshot is that, for each benchmark test, the true average is -- with a +95% level of confidence -- within 5% of the average reported. Here, the +"true average" is the average we would get were we able to run the tests +over and over again an infinite number of times. + +This specification ensures that the calculation of results is controlled; +that someone running the tests in California will use the same technique for +determining benchmark results as someone running the tests in New York. + +In case there is uneven system load due to other processes while this +benchmark suite executes, it might take longer to run the benchmark suite +as compared to a run an unloaded system. This is because the benchmark does +some statistical analysis to make sure that the reported results are +statistically significant (as explained above), and a high variation in +individual runs requires more runs to achieve the required statistical +confidence. + +*** added last the paragraph, Uwe F. Mayer + +Interpreting Results + +Of course, running the benchmarks can present you with a boatload of data. +It can get mystifying, and some of the more esoteric statistical information +is valuable only to a limited audience. The big question is: What does it +all mean? + +First, we should point out that the BYTEmark reports both "raw" and indexed +scores for each test. The raw score for a particular test amounts to the +"iterations per second" of that test. For example, the numeric sort test +reports as its raw score the number of arrays it was able to sort per +second. + +The indexed score is the raw score of the system under test divided by the +raw score obtained on the baseline machine. As of this release, the +baseline machine is a DELL 90 Mhz Pentium XPS/90 with 16 MB of RAM and 256K +of external processor cache. (The compiler used was the Watcom C/C++ 10.0 +compiler; optimizations set to "fastest possible code", 4-byte structure +alignment, Pentium code generation with Pentium register-based calling. The +operating system was MSDOS.) The indexed score serves to "normalize" the +raw scores, reducing their dynamic range and making them easier to +grasp. Simply put, if your machine has an index score of 2.0 on the numeric +sort test, it performed that test twice as fast as this 90 Mhz Pentium. + +If you run all the tests (as you'll see, it is possible to perform "custom +runs", which execute only a subset of the tests) the BYTEmark will also +produce two overall index figures: Integer index and Floating-point index. +The Integer index is the geometric mean of those tests that involve only +integer processing -- numeric sort, string sort, bitfield, emulated +floating-point, assignment, Huffman, and IDEA -- while the Floating-point +index is the geometric mean of those tests that require the floating-point +coprocessor -- Fourier, neural net, and LU decomposition. You can use these +scores to get a general feel for the performance of the machine under test +as compared to the baseline 90 Mhz Pentium. + +The Linux/Unix port has a second baseline machine, it is an AMD K6/233 with +32 MB RAM and 512 KB L2-cache running Linux 2.0.32 and using GNU gcc +version 2.7.2.3 and libc-5.4.38. The integer index was split as suggested +by Andrew D. Balsa <andrewbalsa@usa.net>, and reflects the realization that +memory management is important in CPU design. The original tests have been +left alone, however, the geometric mean of the tests NUMERIC SORT, FP +EMULATION, IDEA, and HUFFMAN now constitutes the integer-arithmetic focused +benchmark index, while the geometric mean of the tests STRING SORT, +BITFIELD, and ASSIGNMENT makes up the new memory index. The floating point +index has been left alone, it is still the geometric mean of FOURIER, +NEURAL NET, and LU DECOMPOSITION. + +*** added the section on Linux, Uwe F. Mayer + +What follows is a list of the benchmarks and associated brief remarks that +describe what the tests do: What they exercise; what a "good" result or a +"bad" result means. Keep in mind that, in this expanding universe of faster +processors, bigger caches, more elaborate memory architectures, "good" and +"bad" are indeed relative terms. A good score on today's hot new processor +will be a bad score on tomorrow's hot new processor. + +These remarks are based on empirical data and profiling that we have done to +date. (NOTE: The profiling is limited to Intel and Motorola 68K on this +release. As more data is gathered, we will be refining this section. +3/14/95--RG) + +Benchmark Description + +Numeric sort Generic integer performance. Should + exercise non-sequential performance + of cache (or memory if cache is less + than 8K). Moves 32-bit longs at a + time, so 16-bit processors will be + at a disadvantage. + + + +String sort Tests memory-move performance. + Should exercise non-sequential + performance of cache, with added + burden that moves are byte-wide and + can occur on odd address boundaries. + May tax the performance of + cell-based processors that must + perform additional shift operations + to deal with bytes. + + + +Bitfield Exercises "bit twiddling" + performance. Travels through memory + in a somewhat sequential fashion; + different from sorts in that data is + merely altered in place. If + properly compiled, takes into + account 64-bit processors, which + should see a boost. + + + +Emulated F.P. Past experience has shown this test + to be a good measurement of overall + performance. + + + +Fourier Good measure of transcendental and + trigonometric performance of FPU. + Little array activity, so this test + should not be dependent of cache or + memory architecture. + + + +Assignment The test moves through large integer + arrays in both row-wise and + column-wise fashion. Cache/memory + with good sequential performance + should see a boost (memory is + altered in place -- no moving as in + a sort operation). Processing is + done in 32-bit chunks -- no + advantage given to 64-bit + processors. + + + +Huffman A combination of byte operations, + bit twiddling, and overall integer + manipulation. Should be a good + general measurement. + + + +IDEA Moves through data sequentially in + 16-bit chunks. Should provide a + good indication of raw speed. + + + +Neural Net Small-array floating-point test + heavily dependent on the exponential + function; less dependent on overall + FPU performance. Small arrays, so + cache/memory architecture should not + come into play. + + + +LU decomposition. A floating-point test that moves + through arrays in both row-wise and + column-wise fashion. Exercises only + fundamental math operations (+, -, + *, /). + +The Command File + +Purpose + +The BYTEmark program allows you to override many of its default parameters +using a command file. The command file also lets you request statistical +information, as well as specify an output file to hold the test results for +later use. + +You identify the command file using a command-line argument. E.G., + +C:NBENCH -cCOMFILE.DAT + +tells the benchmark program to read from COMFILE.DAT in the current +directory. + +The content of the command file is simply a series of parameter names and +values, each on a single line. The parameters control internal variables +that are either global in nature (i.e., they effect all tests in the +program) or are specific to a given benchmark test. + +The parameters are listed in a reference guide that follows, arranged in the +following groups: + +Global Parameters + +Numeric Sort + +String Sort + +Bitfield + +Emulated floating-point + +Fourier coefficients + +Assignment algorithm + +IDEA encryption + +Huffman compression + +Neural net + +LU decomposition + +As mentioned above, those items listed under "Global Parameters" affect all +tests; the rest deal with specific benchmarks. There is no required ordering +to parameters as they appear in the command file. You can specify them in +any sequence you wish. + +You should be judicious in your use of a command file. Some parameters will +override the "dynamic workload" adjustment that each test performs. Doing +this completely bypasses the benchmark code that is designed to produce an +accurate reading from your system clock. Other parameters will alter default +settings, yielding test results that cannot be compared with published +benchmark results. + +A Sample Command File + +Suppose you built a command file that contained the following: + +ALLSTATS=T + +CUSTOMRUN=T + +OUTFILE=D:\DATA.DAT + +DONUMSORT=T + +DOLU=T + +Here's what this file tells the benchmark program: + +ALLSTATS=T means that you've requested a "dump" of all the statistics the +test gathers. This includes not only the standard deviations of tests run, +it also produces test-specific information such as the number of arrays +built, the array size, etc. + +CUSTOMRUN=T tells the system that this is a custom run. Only tests +explicitly specified will be executed. + +OUTFILE=D:\DATA.DAT will write the output of the benchmark to the file +DATA.DAT on the root of the D: drive. (If DATA.DAT already exists, output +will be appended to the file.) + +DONUMSORT=T tells the system to run the numeric sort benchmark. (This was +necessary on account of the CUSTOMRUN=T line, above.) + +DOLU=T tells the system to run the LU decomposition benchmark. + +Command File Parameters Reference + +(NOTE: Altering some global parameters can invalidate results for comparison +purposes. Those parameters are indicated in the following section by a bold +asterisk (*). If you alter any parameters so indicated, you may NOT publish +the resulting data as BYTEmark scores.) + +Global Parameters + +GLOBALMINTICKS=<n> + +This overrides the default global_min_ticks value (defined in NBENCH1.H). +The global_min_ticks value is defined as the minimum number of clock ticks +per iteration of a particular benchmark. For example, if global_min_ticks is +set to 100 and the numeric sort benchmark is run; each iteration MUST take +at least 100 ticks, or the system will expand the work-per-iteration. + +MINSECONDS=<n> + +Sets the minimum number of seconds any particular test will run. This has +the effect of controlling the number of repetitions done. Default: 5. + +ALLSTATS=<T|F> + +Set this flag to T for a "dump" of all statistics. The information displayed +varies from test to test. Default: F. + +OUTFILE=<path> + +Specifies that output should go to the specified output file. Any test +results and statistical data displayed on-screen will also be written to the +file. If the file does not exist, it will be created; otherwise, new output +will be appended to an existing file. This allows you to "capture" several +runs into a single file for later review. + +Note: the path should not appear in quotes. For example, something like the +following would work: OUTFILE=C:\BENCH\DUMP.DAT + +CUSTOMRUN=<T|F> + +Set this flag to T for a custom run. A "custom run" means that the program +will run only the benchmark tests that you explicitly specify. So, use this +flag to run a subset of the tests. Default: F. + +Numeric Sort + +DONUMSORT=<T|F> + +Indicates whether to do the numeric sort. Default is T, unless this is a +custom run (CUSTOMRUN=T), in which case default is F. + +NUMNUMARRAYS=<n> + +Indicates the number of numeric arrays the system will build. Setting this +value will override the program's "dynamic workload" adjustment for this +test.* + +NUMARRAYSIZE=<n> + +Indicates the number of elements in each numeric array. Default is 8001 +entries. (NOTE: Altering this value will invalidate the test for comparison +purposes. The performance of the numeric sort test is not related to the +array size as a linear function; i.e., an array twice as big will not take +twice as long. The relationship involves a logarithmic function.)* + +NUMMINSECONDS=<n> + +Overrides MINSECONDS for the numeric sort test. + +String Sort + +DOSTRINGSORT=<T|F> + +Indicates whether to do the string sort. Default is T, unless this is a +custom run (CUSTOMRUN=T), in which case the default is F. + +STRARRAYSIZE=<n> + +Sets the size of the string array. Default is 8111. (NOTE: Altering this +value will invalidate the test for comparison purposes. The performance of +the string sort test is not related to the array size as a linear function; +i.e., an array twice as big will not take twice as long. The relationship +involves a logarithmic function.)* + +NUMSTRARRAYS=<n> + +Sets the number of string arrays that will be created to run the test. +Setting this value will override the program's "dynamic workload" adjustment +for this test.* + +STRMINSECONDS=<n> + +Overrides MINSECONDS for the string sort test. + +Bitfield + +DOBITFIELD=<T|F> + +Indicates whether to do the bitfield test. Default is T, unless this is a +custom run (CUSTOMRUN=T), in which case the default is F. + +NUMBITOPS=<n> + +Sets the number of bitfield operations that will be performed. Setting this +value will override the program's "dynamic workload" adjustment for this +test.* + +BITFIELDSIZE=<n> + +Sets the number of 32-bit elements in the bitfield arrays. The default value +is dependent on the size of a long as defined by the current compiler. For a +typical compiler that defines a long to be 32 bits, the default is 32768. +(NOTE: Altering this parameter will invalidate test results for comparison +purposes.)* + +BITMINSECONDS=<n> + +Overrides MINSECONDS for the bitfield test. + +Emulated floating-point + +DOEMF=<T|F> + +Indicates whether to do the emulated floating-point test. Default is T, +unless this is a custom run (CUSTOMRUN=T), in which case the default is F. + +EMFARRAYSIZE=<n> + +Sets the size (number of elements) of the emulated floating-point benchmark. +Default is 3000. The test builds three arrays, each of equal size. This +parameter sets the number of elements for EACH array. (NOTE: Altering this +parameter will invalidate test results for comparison purposes.)* + +EMFLOOPS=<n> + +Sets the number of loops per iteration of the floating-point test. Setting +this value will override the program's "dynamic workload" adjustment for +this test.* + +EMFMINSECONDS=<n> + +Overrides MINSECONDS for the emulated floating-point test. + +Fourier coefficients + +DOFOUR=<T|F> + +Indicates whether to do the Fourier test. Default is T, unless this is a +custom run (CUSTOMRUN=T), in which case the default is F. + +FOURASIZE=<n> + +Sets the size of the array for the Fourier test. This sets the number of +coefficients the test will derive. NOTE: Specifying this value will override +the system's "dynamic workload" adjustment for this test, and may make the +results invalid for comparison purposes.* + +FOURMINSECONDS=<n> + +Overrides MINSECONDS for the Fourier test. + +Assignment Algorithm + +DOASSIGN=<T|F> + +Indicates whether to do the assignment algorithm test. Default is T, unless +this is a custom run (CUSTOMRUN=T), in which case the default is F. + +ASSIGNARRAYS=<n> + +Indicates the number of arrays that will be built for the test. Specifying +this value will override the system's "dynamic workload" adjustment for this +test. (NOTE: The size of the arrays in the assignment algorithm is fixed at +101 x 101. Altering the array size requires adjusting global constants and +recompiling; to do so, however, would invalidate test results.)* + +ASSIGNMINSECONDS=<n> + +Overrides MINSECONDS for the assignment algorithm test. + +IDEA encryption + +DOIDEA=<T|F> + +Indicates whether to do the IDEA encryption test. Default is T, unless this +is a custom run (CUSTOMRUN=T), in which case the default is F. + +IDEAARRAYSIZE=<n> + +Sets the size of the plain-text character array that will be encrypted by the +test. Default is 4000. The benchmark actually builds 3 arrays: 1st +plain-text, encrypted version, and 2nd plain-text. The 2nd plain-text array is +the destination for the decryption process [part of the test]. All arrays +are set to the same size. (NOTE: Specifying this value will invalidate test +results for comparison purposes.)* + +IDEALOOPS=<n> + +Indicates the number of loops in the IDEA test. Specifying this value will +override the system's "dynamic workload" adjustment for this test.* + +IDEAMINSECONDS=<n> + +Overrides MINSECONDS for the IDEA test. + +Huffman compression + +DOHUFF=<T|F> + +Indicates whether to do the Huffman test. Default is T, unless this is a +custom run (CUSTOMRUN=T), in which case the default is F. + +HUFFARRAYSIZE=<n> + +Sets the size of the string buffer that will be compressed using the Huffman +test. The default is 5000. (NOTE: Altering this value will invalidate test +results for comparison purposes.)* + +HUFFLOOPS=<n> + +Sets the number of loops in the Huffman test. Specifying this value will +override the system's "dynamic workload" adjustment for this test.* + +HUFFMINSECONDS=<n> + +Overrides MINSECONDS for the Huffman test. + +Neural net + +DONNET=<T|F> + +Indicates whether to do the Neural Net test. Default is T, unless this is a +custom run (CUSTOMRUN=T), in which case the default is F. + +NNETLOOPS=<n> + +Sets the number of loops in the Neural Net test. NOTE: Altering this value +overrides the benchmark's "dynamic workload" adjustment algorithm, and may +invalidate the results for comparison purposes.* + +NNETMINSECONDS=<n> + +Overrides MINSECONDS for the Neural Net test. + +LU decomposition + +DOLU=<T|F> + +Indicates whether to do the LU decomposition test. Default is T, unless this +is a custom run (CUSTOMRUN=T), in which case the default is F. + +LUNUMARRAYS=<n> + +Sets the number of arrays in each iteration of the LU decomposition test. +Specifying this value will override the system's "dynamic workload" +adjustment for this test.* + +LUMINSECONDS=<n> + +Overrides MINSECONDS for the LU decomposition test. + +Numeric Sort + +Description + +This benchmark is designed to explore how well the system sorts a numeric +array. In this case, a numeric array is a one-dimensional collection of +signed, 32-bit integers. The actual sorting is performed by a heapsort +algorithm (see the text box following for a description of the heapsort +algorithm). + +It's probably unnecessary to point out (but we'll do it anyway) that sorting +is a fundamental operation in computer application software. You'll likely +find sorting routines nestled deep inside a variety of applications; +everything from database systems to operating-systems kernels. + +The numeric sort benchmark reports the number of arrays it was able to sort +per second. The array size is set by a global constant (it can be overridden +by the command file -- see below). + +Analysis + +Optimized 486 code: Profiling of the numeric sort benchmark using Watcom's +profiler (Watcom C/C++ 10.0) indicates that the algorithm spends most of its +time in the numsift() function (specifically, about 90% of the benchmark's +time takes place in numsift()). Within numsift(), two if statements dominate +time spent: + +if(array[k]<array[k+1L]) and if(array[i]<array[k]) + +Both statements involve indexes into arrays, so it's likely the processor is +spending a lot of time resolving the array references. (Though both +statements involve "less-than" comparisons, we doubt that much time is +consumed in performing the signed compare operation.) Though the first +statement involves array elements that are adjacent to one another, the +second does not. In fact, the second statement will probably involve +elements that are far apart from one another during early passes through the +sifting process. We expect that systems whose caching system pre-fetches +contiguous elements (often in "burst" line fills) will not have any great +advantage of systems without pre-fetch mechanisms. + +Similar results were found when we profiled the numeric sort algorithm under +the Borland C/C++ compiler. + +680x0 Code (Macintosh CodeWarrior): CodeWarrior's profiler is function +based; consequently, it does not allow for line-by-line analysis as does the +Watcom compiler's profiler. + +However, the CodeWarrior profiler does give us enough information to note +that NumSift() only accounts for about 28% of the time consumed by the +benchmark. The outer routine, NumHeapSort() accounts for around 71% of the +time taken. It will require additional analysis to determine why the two +compilers -- Watcom and CodeWarrior divide the workload so differently. (It +may have something to do with compiler architecture, or the act of profiling +the code may produce results that are significantly different than how the +program runs under normal conditions, though that would lead one to wonder +what use profilers would be.) + +Porting Considerations + +The numeric sort routine should represent a trivial porting exercise. It is +not an overly large benchmark in terms of source code. Additionally, the +only external routines it calls on are for allocating and releasing memory, +and managing the stopwatch. + +The numeric sort benchmark depends on the following global definitions (note +that these may be overridden by the command file): + +NUMNUMARRAYS -- Sets the upper limit on the number of arrays that the +benchmark will attempt to build. The numeric sort benchmark creates work for +itself by requiring the system to sort more and more arrays...not bigger and +bigger arrays. (The latter case would skew results, because the sorting time +for heapsort is N log2 N - e.g., doubling the array size does not double the +sort time.) This constant sets the upper limit to the number of arrays the +system will build before it signals an error. The default value is 100, and +may be changed if your system exceeds this limit. + +NUMARRAYSIZE - Determines the size of each array built. It has been set to +8111L and should not be tampered with. The command file entry +NUMARRAYSIZE=<n> can be used to change this value, but results produced by +doing this will make your results incompatible with other runs of the +benchmark (since results will be skewed -- see preceding paragraph). + +To test for a correct execution of the numeric sort benchmark, #define the +DEBUG symbol. This will enable code that verifies that arrays are properly +sorted. You should run the benchmark program using a command file that has +only the numeric sort test enabled. If there is an error, the program will +display "SORT ERROR" (If this happens, it's possible that tons of "SORT +ERROR" messages will be emitted, so it's best not to redirect output to a +file), otherwise it will print "Numeric sort: OK" (also quite a few times). + +References + +Gonnet, G.H. 1984, Handbook of Algorithms and Data Structures (Reading, MA: +Addison-Wesley). + +Knuth, Donald E. 1968, Fundamental Algorithms, vol 1 of The Art of Computer +Programming (Reading, MA: Addison-Wesley). + +Press, William H., Flannery, Brian P., Teukolsky, Saul A., and Vetterling, +William T. 1989, Numerical Recipes in Pascal (Cambridge: Cambridge +University Press). + +Heapsort + +The heapsort algorithm is well-covered in a number of the popular +computer-science textbooks. In fact, it gets a pat on the back in Numerical +Recipes (Press et. al.), where the authors write: + +Heapsort is our favorite sorting routine. It can be recommended +wholeheartedly for a variety of sorting applications. It is a true +"in-place" sort, requiring no auxiliary storage. + +Heapsort works by building the array into a kind of a queue called a heap. +You can imagine this heap as being a form of in-memory binary tree. The +topmost (root) element of the tree is the element that -- were the array +sorted -- would be the largest element in the array. Sorting takes place by +first constructing the heap, then pulling the root off the tree, promoting +the next largest element to the root, pulling it off, and so on. (The +promotion process is known as "sifting up.") + +Heapsort executes in N log2 N time even in its worst case. Unlike some other +sorting algorithms, it does not benefit from a partially sorted array +(though Gonnet does refer to a variation of heapsort, called "smoothsort," +which does -- see references). + +String Sort + +Description + +This benchmark is designed to gauge how well the system moves bytes around. +By that we mean, how well the system can copy a string of bytes from one +location to another; source and destination being aligned to arbitrary +addresses. (This is unlike the numeric sort array, which moves bytes +longword-at-a-time.) The strings themselves are built so as to be of random +length, ranging from no fewer than 4 bytes and no greater than 80 bytes. The +mixture of random lengths means that processors will be forced to deal with +strings that begin and end on arbitrary address boundaries. + +The string sort benchmark uses the heapsort algorithm; this is the same +algorithm as is used in the numeric sort benchmark (see the sidebar on the +heapsort for a detailed description of the algorithm). + +Manipulation of the strings is actually handled by two arrays. One array +holds the strings themselves; the other is a pointers array. Each member of +the pointers array carries an offset that points into the string array, so +that the ith pointer carries the offset to the ith string. This allows the +benchmark to rapidly locate the position of the ith string. (The sorting +algorithm requires exchanges of items that might be "distant" from one +another in the array. It's critical that the routine be able to rapidly find +a string based on its indexed position in the array.) + +The string sort benchmark reports the number of string arrays it was able to +sort per second. The size of the array is set by a global constant. + +Analysis + +Optimized 486 code (Watcom C/C++ 10.0): Profiling of the string sort +benchmark indicates that it spends most of its time in the C library routine +memmove(). Within that routine, most of the execution is consumed by a pair +of instructions: rep movsw and rep movsd. These are repeated string move -- +word width and repeated string move -- doubleword width, respectively. + +This is precisely where we want to see the time spent. It's interesting to +note that the memmove() of the particular compiler/profiler tested (Watcom +C/C++ 10.0) was "smart" enough to do most of the moving on word or +doubleword boundaries. The string sort benchmark specifically sets arbitrary +boundaries, so we'd expect to see lots of byte-wide moves. The "smart" +memmove() is able to move bytes only when it has to, and does the remainder +of the work via words and doublewords (which can move more bits at a time). + +680x0 Code (Macintosh CodeWarrior): Because CodeWarrior's profiler is +function based, it is impossible to get an idea of how much time the test +spends in library routines such as memmove(). Fortunately, as an artifact of +the early version of the benchmark, the string sort algorithm makes use of +the MoveMemory() routine in the sysspec.c file (system specific routines). +This call, on anything other than a 16-bit DOS system, calls memmove() +directly. Hence, we can get a good approximation of how much time is spent +moving bytes. + +The answer is that nearly 78% of the benchmark's time is consumed by +MoveMemory(), the rest being taken up by the other routines (the +str_is_less() routine, which performs string comparisons, takes about 7% of +the time). As above, we can guess that most of the benchmark's time is +dependent on the performance of the library's memmove() routine. + +Porting Considerations + +As with the numeric sort routine, the string sort benchmark should be simple +to port. Simpler, in fact. The string sort benchmark routine is not +dependent on any typedef that may change from machine to machine (unless a +char type is not 8 bits). + +The string sort benchmark depends on the following global definitions: + +NUMSTRARRAYS - Sets the upper limit on the number of arrays that the +benchmark will attempt to build. The string sort benchmark creates work for +itself by requiring the system to sort more and more arrays, not bigger and +bigger arrays. (See section on Numeric Sort for an explanation.) This +constant sets the upper limit to the number of arrays the system will build +before it signals an error. The default value is 100, and may be changed if +your system exceeds this limit. + +STRARRAYSIZE - Sets the default size of the string arrays built. We say +"arrays" because, as with the numeric sort benchmark, the system adds work +not by expanding the size of the array, but by adding more arrays. This +value is set to 8111, and should not be modified, since results would not be +comparable with other runs of the same benchmark on other machines. + +To test for a correct execution of the string sort benchmark, #define +the DEBUG symbol. This will enable code that verifies the arrays are +properly sorted. Set up a command file that runs only the string sort, +and execute the benchmark program. If the routine is operating +properly, the benchmark will print "String sort: OK", this message is +printed quite often. Otherwise, the program will display "SORT ERROR" +for each pair of strings it finds out of order (which can be really +often). + +References + +See the references for the Numeric Sort benchmark. + +Bitfield Operations + +Description + +The purpose of this benchmark is to explore how efficiently the system +executes operations that deal with "twiddling bits." The test is set up to +simulate a "bit map"; a data structure used to keep track of storage usage. +(Don't confuse this meaning of "bitmap" with its use in describing a +graphics data structure.) + +Systems often use bit maps to keep an inventory of memory blocks or (more +frequently) disk blocks. In the case of a bit map that manages disk usage, +an operating system will set aside a buffer in memory so that each bit in +that buffer corresponds to a block on the disk drive. A 0 bit means that the +corresponding block is free; a 1 bit means the block is in use. Whenever a +file requests a new block of disk storage, the operating system searches the +bit map for the first 0 bit, sets the bit (to indicate that the block is now +spoken for), and returns the number of the corresponding disk block to the +requesting file. + +These types of operations are precisely what this test simulates. A block of +memory is set allocated for the bit map. Another block of memory is +allocated, and set up to hold a series of "bit map commands". Each bitmap +command tells the simulation to do 1 of 3 things: + +1) Clear a series of consecutive bits, + +2) Set a series of consecutive bits, or + +3) Complement (1->0 and 0->1) a series of consecutive bits. + +The bit map command block is loaded with a set of random bit map commands +(each command covers an random number of bits), and simulation routine steps +sequentially through the command block, grabbing a command and executing it. + +The bitfield benchmark reports the number of bits it was able to operate on +per second. The size of the bit map is constant; the bitfield operations +array is adjusted based on the capabilities of the processor. (See the +section describing the auto-adjust feature of the benchmarks.) + +Analysis + +Optimized 486 code: Using the Watcom C/C++ 10.0 profiler, the Bitfield +benchmark appears to spend all of its time in two routines: ToggleBitRun() +(74% of the time) and DoBitFieldIteration() (24% of the time). We say +"appears" because this is misleading, as we will explain. + +First, it is important to recall that the test performs one of three +operations for each run of bits (see above). The routine ToggleBitRun() +handles two of those three operations: setting a run of bits and clearing a +run of bits. An if() statement inside ToggleBitRun() decides which of the +two operations is performed. (Speed freaks will quite rightly point out that +this slows the entire algorithm. ToggleBitRun() is called by a switch() +statement which has already decided whether bits should be set or cleared; +it's a waste of time to have ToggleBitRun() have to make that decision yet +again.) + +DoBitFieldIteration() is the "outer" routine that calls ToggleBitRun(). +DoBitFieldIteration() also calls FlipBitRun(). This latter routine is the +one that performs the third bitfield operation: complementing a run of bits. +FlipBitRun() gets no "air time" at all (while DoBitFieldIteration() gets 24 +% of the time) simply because the compiler's optimizer recognizes that +FlipBitRun() is only called by DoBitFieldIteration(), and is called only +once. Consequently, the optimizer moves FlipBitRun() "inline", i.e., into +DoBitFieldIteration(). This removes an unnecessary call/return cycle (and is +probably part of the reason why the FlipBitRun() code gets 24% of the +algorithm's time, instead of something closer to 30% of its time.) + +Within the routines, those lines of code that actually do the shifting, the +and operations, and the or operations, consume time evenly. This should make +for a good test of a processor's "bit twiddling" capabilities. + +680x0 Code (Macintosh CodeWarrior): The CodeWarrior profiler is function +based. Consequently, it is impossible to produce a profile of machine +instruction execution time. We can, however, get a good picture of how the +algorithm divides its time among the various functions. + +Unlike the 486 compiler, the CodeWarrior compiler did not appear to collapse +the FlipBitRun() routine into the outer DoBitFieldIteration() routine. (We +don't know this for certain, of course. It's possible that the compiler +would have done this had we not been profiling.) + +In any case, the time spent in the two "core" routines of the bitfield test +are shown below: + +FlipBitRun() - 18031.2 microsecs (called 509 times) + +ToggleBitRun() - 50770.6 microsecs (called 1031 times) + +In terms of total time, FlipBitRun() takes about 35% of the time (it gets +about 33% of the calls). Remember, ToggleBitRun() is a single routine that +is called both to set and clear bits. Hence, ToggleBitRun() is called twice +as often as FlipBitRun(). + +We can conclude that time spent setting bits to 1, setting bits to 0, and +changing the state of bits, is about equal; the load is balanced close to +what we'd expect it to be, based on the structure of the algorithm. + +Porting Considerations + +The bitfield operations benchmark is dependent on the size of the long +datatype. On most systems, this is 32 bits. However, on some of the newer +RISC chips, a long can be 64 bits long. If your system does use 64-bit +longs, you'll need to #define the symbol LONG64. + +If you are unsure of the size of a long in your system (some C compiler +manuals make it difficult to discover), simply place an ALLSTATS=T line in +the command file and run the benchmarks. This will cause the benchmark +program to display (among other things) the size of the data types int, +short, and long in bytes. + +BITFARRAYSIZE - Sets the number of longs in the bit map array. This number +is fixed, and should not be altered. The bitfield test adjusts itself by +adding more bitfield commands (see above), not by creating a larger bit map. + +Currently, there is no code added to test for correct execution. If you are +concerned that your port was incorrect, you'll need to step through your +favorite debugger and verify execution against the original source code. + +** I added a resetting of the random number generator, and a resetting +** of the bitfield to each loop. Those operations are outside of the +** timed loop, and should add to make the benchmark more consistent. +** There also is now debugging information available. If you define +** DEBUG then the program will write a file named "debugbit.dat", +** which is the contents of the bitfield after the calibration loop of +** 30 operations. You can compare this file with the file +** "debugbit.good" that comes with the distribution. +** Uwe F. Mayer <mayer@tux.edu> + +References + +None. + +Emulated Floating-point + +Description + +The emulated floating-point benchmark includes routines that are similar to +those that would be executed whenever a system performs floating-point +operations in the absence of a coprocessor. In general, this amounts to a +mixture of integer instructions, including shift operations, integer +addition and subtraction, and bit testing (among others). + +The benchmark itself is remarkably simple. The test builds three +1-dimensional arrays and loads the first two up with random floating-point +numbers. The arrays are then partitioned into 4 equal-sized groups, and the +test proceeds by performing addition, subtraction, multiplication, and +division -- one operation on each group. (For example, for the addition +group, an element from the first array is added to the second array and the +result is placed in the third array.) + +Of course, most of the work takes place inside the routines that perform the +addition, subtraction, multiplication, and division. These routines operate +on a special data type (referred to as an InternalFPF number) that -- though +not strictly IEEE compliant -- carries all the necessary data fields to +support an IEEE-compatible floating-point system. Specifically, an +InternalFPF number is built up of the following fields: + +Type (indicates a NORMAL, SUBNORMAL, etc.) + +Mantissa sign + +Unbiased, signed 16-bit exponent + +4-word (16 bits) mantissa. + +The emulated floating-point test reports its results in number of loops per +second (where a "loop" is one pass through the arrays as described above). + +Finally, we are aware that this test could be on its way to becoming an +anachronism. A growing number of systems are appearing that have +coprocessors built into the main CPU. It's possible that floating-point +emulation will one day be a thing of the past. + +Analysis + +Optimized 486 code (Watcom C/C++ 10.0): The algorithm's time is distributed +across a number of routines. The distribution is: + +ShiftMantLeft1() - 60% of the time + +ShiftMantRight1() - 17% of the time + +DivideInternalFPF() - 14% of the time + +MultiplyInternalFPF() - 5% of the time. + +The first two routines are similar to one another; both shift bits about in +a floating-point number's mantissa. It's reasonable that ShiftMantLeft1() +should take a larger share of the system's time; it is called as part of the +normalization process that concludes every emulated addition, subtraction, +mutiplication, and division. + +680x0 Code (Macintosh CodeWarrior): CodeWarrior's profiler is +function-based; consequently, it isn't possible to get timing at the machine +instruction level. However, the output to CodeWarrior's profiler has +provided insight into the breakdown of time spent in various functions that +forces us to rethink our 486 code analysis. + +Analyzing what goes on inside the emulated floating-point tests is a tough +one to call because some of the routines that are part of the test are +called by the function that builds the arrays. Consequently, a quick look at +the profiler's output can be misleading; it's not obvious how much time a +particular routine is spending in the test and how much time that same +routine is spending setting up the test (an operation that does not get +timed). + +Specifically, the routine that loads up the arrays with test data calls +LongToInternalFPF() and DivideInternalFPF(). LongToInternalFPF() makes one +call to normalize() if the number is not a true zero. In turn, normalize() +makes an indeterminate number of calls to ShiftMantLeft1(), depending on the +structure of the mantissa being normalized. + +What's worse, DivideInternalFPF() makes all sorts of calls to all kinds of +important low-level routines such as Sub16Bits() and ShiftMantLeft1(). +Untangling the wiring of which routine is being called as part of the test, +and which is being called as part of the setup could probably be done with +the computer equivalent of detective work and spelunking, but in the +interest of time we'll opt for approximation. + +Here's a breakdown of some of the important routines and their times: + +AddSubInternalFPF() - 1003.9 microsecs (called 9024 times) + +MultiplyInternalFPF() - 20143 microsecs (called 5610 times) + +DivideInternalFPF() - 18820.9 microsecs (called 3366 times). + +The 3366 calls to DivideInternalFPF() are timed calls, not setup calls -- +the profiler at least gives outputs of separate calls made to the same +routine, so we can determine which call is being made by the benchmark, and +which is being made by the setup routine. It turns out that the setup +routine calls DivideInternalFPF() 30,000 times. + +Notice that though addition/subtraction are called most often, +multiplication next, then finally division; the time spent in each is the +reverse. Division takes the most time, then multiplication, finally +addition/subtraction. (There's probably some universal truth lurking here +somewhere, but we haven't found it yet.) + +Other routines, and their breakdown: + +Add16Bits() - 115.3 microsecs + +ShiftMantRight1() - 574.2 microsecs + +Sub16Bits() - 1762 microsecs + +StickySiftRightMant - 40.4 microsecs + +ShiftMantLeft1() - 17486.1 microsecs + +The times for the last three routines are suspect, since they are called by +DivideInternalFPF(), and a large portion of their time could be part of the +setup process. This is what leads us to question the results obtained in the +486 analysis, since it, too, is unable to determine precisely who is calling +whom. + +Porting Considerations + +Earlier versions of this benchmark were extremely sensitive to porting; +particularly to the "endianism" of the target system. We have tried to +eliminate many of these problems. The test is nonetheless more "sensitive" +to porting than most others. + +Pay close attention to the following defines and typedefs. They can be found +in the files EMFLOAT.H, NMGLOBAL.H, and NBENCH1.H: + +u8 - Stands for unsigned, 8-bit. Usually defined to be unsigned char. + +u16 - Stands for unsigned, 16-bit. Usually defined to be unsigned short. + +u32 - Stands for unsigned, 32-bit. Usually defined to be unsigned long. + +INTERNAL_FPF_PRECISION - Indicates the number of elements in the mantissa of +an InternalFPF number. Should be set to 4. + +The exponent field of an InternalFPF number is of type short. It should be +set to whatever minimal data type can hold a signed, 16-bit number. + +Other global definitions you will want to be aware of: + +CPUEMFLOATLOOPMAX - Sets the maximum number of loops the benchmark will +attempt before flagging an error. Each execution of a loop in the emulated +floating-point test is "non-destructive," since the test takes factors from +two arrays, operates on the factors, and places the result in a third array. +Consequently, the test makes more work for itself by increasing the number +of times it passes through the arrays (# of loops). If the system exceeds +the limit set by CPUEMFLOATLOOPMAX, it will signal an error. + +This value may be altered to suit your system; it will not effect the +benchmark results (unless you reduce it so much the system can never +generate enough loops to produce a good test run). + +EMFARRAYSIZE - Sets the size of the arrays to be used in the test. This +value is the number of entries (InternalFPF numbers) per array. Currently, +the number is fixed at 3000, and should not be altered. + +Currently, there is no means of testing correct execution of the benchmark +other than via debugger. There are routines available to decode the internal +floating point format and print out the numbers, but no formal correctness +test has been constructed. (This should be available soon. -- 3/14/95 RG) + +** It now prints out the operations of 8 of the entries used in the +** test. Assuming you leave EMFARRAYSIZE at 3000, your results should +** look like the ones below. The number in front of the colon is the +** index of the entry. +** +** 2: (-1.1160E 0) + (-4.5159E 0) = -5.6320E 0 +** 6: (-4.4507E -1) - (-8.2050E -1) = +3.7543E -1 +** 10: (+1.2465E 0) * (+7.4667E -1) = +9.3075E -1 +** 14: (-1.2781E 0) / (-1.7367E 0) = +7.3596E -1 +** 2986: (-7.0390E 0) * (-2.0752E 0) = +1.4607E 1 +** 2990: (+8.3753E -1) / (+2.3876E 1) = +3.5078E -2 +** 2994: (-1.1393E 0) + (-1.6080E 1) = -1.7219E 1 +** 2998: (+7.2450E 0) - (-8.2654E -1) = +8.0716E 0 +** +** Uwe F. Mayer <mayer@tux.edu> + +References + +Microprocessor Programming for Computer Hobbyists, Neill Graham, Tab Books, +Blue Ridge Summit, PA, 1977. + +Apple Numerica Manual, Second edition, Apple Computer, Addison-Wesley +Publishing Co., Reading, MA, 1988. + +Fourier Series + +Description + +This is a floating-point benchmark designed primarily to exercise the +trigonometric and transcendental functions of the system. It calculates the +first n Fourier coefficients of the function (x+1)x on the interval 0,2. In +this case, the function (x+1)x is being treated as a cyclic waveform with a +period of 2. + +The Fourier coefficients, when applied as factors to a properly constructed +series of sine and cosine functions, allow you to approximate the original +waveform. (In fact, if you can calculate all the Fourier coefficients -- +there'll be an infinite number -- you can reconstruct the waveform exactly). +You have to calculate the coefficients via integration, and the algorithm +does this using a simple trapezoidal rule for its numeric integration +function. + +The upshot of all this is that it provides an exercise for the +floating-point routines that calculate sine, cosine, and raising a number to +a power. There are also some floating-point multiplications, divisions, +additions, and subtractions mixed in. + +The benchmark reports its results as the number of coefficients calculated +per second. + +As an additional note, we should point out that the performance of this +benchmark is heavily dependent on how well-built the compiler's math library +is. We have seen at least two cases where recompilation with new (and +improved!) math libraries have resulted in two-fold and five-fold +performance improvements. (Apparently, when a compiler gets moved to a new +platform, the trigonometric and transcendental functions in the math +libraries are among the last routines to be "hand optimized" for the new +platform.) About all we can say about this is that whenever you run this +test, verify that you have the latest and greatest math libraries. + +Analysis + +Optimized 486 code: The benchmark partitions its time almost evenly among +the modules pow387, exp386, and trig387; giving between 25% and 28% of its +time to each. This is based on profiling with the Watcom compiler running +under Windows NT. These modules hold the routines that handle raising a +number to a power and performing trigonometric (sine and cosine) +calculations. For example, within trig387, time was nearly equally divided +between the routine that calculates sine and the routine that calculates +cosine. + +The remaining time (between 17% and 18%) was spent in the balance of the +test. We noticed that most of that time occurred in the routine +thefunction(). This is at the heart of the numerical integration routine the +benchmark uses. + +Consequently, this benchmark should be a good test of the exponential and +trigonometric capabilities of a processor. (Note that we recognize that the +performance also depends on how well the compiler's math library is built.) + +680x0 Code (Macintosh CodeWarrior): The CodeWarrior profiler is function +based, therefore it is impossible to get performance results for individual +machine instructions. The CodeWarrior compiler is also unable to tell us how +much time is spent within a given library routine; we can't see how much +time gets spent executing the sin(), cos(), or pow() functions (which, +unfortunately, was the whole idea behind the benchmark). + +About all we can glean from the results is that thefunction() takes about +74% of the time in the test (this is where the heavy math calculations take +place) while trapezoidintegrate() accounts for about 26% of the time on its +own. + +Porting Considerations + +Necessarily, this benchmark is at the mercy of the efficiency of the +floating-point support provided by whatever compiler you are using. It is +recommended that, if you are doing the port yourself, you contact the +designers of the compiler, and discuss with them what optimization switches +should be set to produce the fastest code. (This sounds simple; usually it's +not. Some systems let you decide between speed and true IEEE compliance.) + +As far as global definitions go, this benchmark is happily free of them. All +the math is done using double data types. We have noticed that, on some Unix +systems, you must be careful to include the correct math libraries. +Typically, you'll discover this at link time. + +To test for correct execution of the benchmark: It's unlikely you'll need to +do this, since the algorithm is so cut-and-dried. Furthermore, there are no +explicit provisions made to verify the correctness. You can, however, either +dip into your favorite debugger, or alter the code to print out the contents +of the abase (which holds the A[i] terms) and bbase (which holds the B[i] +terms) arrays as they are being filled (see routine DoFPUTransIteration). +** This is exactly what I have done, it now prints out A[i] and B[i] data. +** Uwe F. Mayer <mayer@tux.edu> +Run the benchmark with a command file set to execute only the Fourier test, +and examine the contents of the arrays. The first 100 are listed below. + +A[i]= + 2.84 1.05 0.274 0.0824 0.0102 -0.024 -0.0426 -0.0536 -0.0605 -0.065 +-0.0679 -0.0698 -0.0709 -0.0715 -0.0717 -0.0715 -0.0711 -0.0704 +-0.0696 -0.0685 -0.0674 -0.0661 -0.0647 -0.0632 -0.0615 -0.0598 -0.058 +-0.0561 -0.0542 -0.0521 -0.0501 -0.0479 -0.0457 -0.0434 -0.0411 +-0.0387 -0.0363 -0.0338 -0.0313 -0.0288 -0.0262 -0.0236 -0.0209 +-0.0183 -0.0156 -0.0129 -0.0102 -0.00744 -0.0047 -0.00196 0.000794 +0.00355 0.0063 0.00905 0.0118 0.0145 0.0172 0.0199 0.0226 0.0253 +0.0279 0.0305 0.0331 0.0357 0.0382 0.0407 0.0431 0.0455 0.0479 0.0502 +0.0525 0.0547 0.0569 0.059 0.061 0.063 0.0649 0.0668 0.0686 0.0703 +0.072 0.0736 0.0751 0.0765 0.0779 0.0792 0.0804 0.0816 0.0826 0.0836 +0.0845 0.0853 0.0861 0.0867 0.0873 0.0877 0.0881 0.0884 0.0887 0.0888 + +B[i]= +(undefined) -1.88 -1.16 -0.806 -0.61 -0.487 -0.402 -0.34 -0.293 -0.255 +-0.224 -0.199 -0.177 -0.158 -0.141 -0.126 -0.113 -0.101 -0.0901 +-0.0802 -0.071 -0.0625 -0.0546 -0.0473 -0.0404 -0.034 -0.0279 -0.0222 +-0.0168 -0.0117 -0.00693 -0.00238 0.00193 0.00601 0.00988 0.0135 0.017 +0.0203 0.0234 0.0263 0.0291 0.0317 0.0341 0.0364 0.0385 0.0405 0.0424 +0.0441 0.0457 0.0471 0.0484 0.0496 0.0507 0.0516 0.0525 0.0532 0.0538 +0.0543 0.0546 0.0549 0.055 0.0551 0.055 0.0549 0.0546 0.0543 0.0538 +0.0533 0.0527 0.052 0.0512 0.0503 0.0493 0.0483 0.0472 0.046 0.0447 +0.0434 0.042 0.0405 0.039 0.0374 0.0358 0.0341 0.0323 0.0305 0.0287 +0.0268 0.0249 0.023 0.021 0.019 0.0169 0.0149 0.0128 0.0107 0.00857 +0.00644 0.0043 0.00215 + +Note that there is no B[0] coefficient. If the above numbers are in the +arrays shown, you can feel pretty confident that the benchmark it working +properly. + +References + +Engineering and Scientific Computations in Pascal, Lawrence P. Huelsman, +Harper & Row, New York, 1986. + +Assignment Algorithm + +Description + +This test is built on an algorithm with direct application to the business +world. The assignment algorithm solves the following problem: Say you have X +machines and Y jobs. Any of the machines can do any of the jobs; however, the +machines are sufficiently different so that the cost of doing a particular +job can vary depending what machine does it. Furthermore, the jobs are +sufficiently different that the cost varies depending on which job a given +machine does. You therefore construct a matrix; machines are the rows, jobs +are the columns, and the [i,j] element of the array is the cost of doing the +jth job on the ith machine. How can you assign the jobs so that the cost of +completing them all is minimal? (This also assumes that one machine does one +job.) + +Did you get that? + +The assignment algorithm benchmark is largely a test of how well the +processor handles problems built around array manipulation. It is not a +floating-point test; the "cost matrix" built by the algorithm is simply a 2D +array of long integers. This benchmark considers an iteration to be a run of +the assignment algorithm on a 101 x 101 - element matrix. It reports its +results in iterations per second. + +Analysis + +Optimized 486 code (Watcom C/C++ 10.0): There are numerous loops within the +assignment algorithm. The development system we were using (Watcom C/C++ +10.0) appears to have a fine time unrolling many of them. Consequently, it +is difficult to pin down the execution impact of single lines (as in, for +example, the numeric sort benchmark). + +On the level of functions, the benchmark spends around 70% of its time in +the routine first_assignments(). This is where a) lone zeros in rows and +columns are found and selected, and b) a choice is made between duplicate +zeros. Around 23% of the time is spent in the second_assignments() routine +where (if first_assignments() fails) the matrix is partitioned into smaller +submatrices. + +Overall, we did a tally of instruction mix execution. The approximate +breakdowns are: + +move - 38% + +conditional jump - 12% + +unconditional jump - 11% + +comparison - 14% + +math/logical/shift - 24% + +Many of the move instructions that appeared to consume the most amounts of +time were referencing items on the local stack frame. This required an +indirect reference through EBP, plus a constant offset to resolve the +address. + +This should be a good exercise of a cache, since operations in the +first_assignments() routine require both row-wise and column-wise movement +through the array. Note that the routine could be made more "severe" by +chancing the assignedtableau[][] array to an array of unsigned char -- +forcing fetches on byte boundaries. + +680x0 Code (CodeWarrior): The CodeWarrior profiler is function-based. +Consequently, it's not possible to determine what's going on at the machine +instruction level. We can, however, get a good idea of how much time the +algorithm spends in each routine. The important routines are broken down as +follows: + +calc_minimum_costs() - approximately 0.3% of the time + +(250 microsecs) + +first_assignments() - approximately 79% of the time + +(96284.6 microsecs) + +second_assignments() - approximately 19% of the time + +(22758 microsecs) + +These times are approximate; some time is spent in the Assignment() routine +itself. + +These figures are reasonably close to those of the 486, at least in terms of +the mixture of time spent in a particular routine. Hence, this should still +be a good test of system cache (as described in the preceding section), +given the behavior of the first_assignments() routine. + +Porting Considerations + +The assignment algorithm test is purely an integer benchmark, and requires +no special data types that might be affected by ports to different +architectures. There are only two global constants that affect the +algorithm: + +ASSIGNROWS and ASSIGNCOLS - These set the size of the assignment array. Both +are defined to be 101 (so, the array that is benchmarked is a 101 x 101 +-element array of longs). These values should not be altered. + +To test for correct execution of the benchmark: #define the symbol DEBUG, +recompile, set up a command file that executes only the assignment +algorithm, and run the benchmark. (You may want to pipe the output through a +paging filter, like the more program.) The act of defining DEBUG will enable +a section of code that displays the assigned columns on a per-row basis. If +the benchmark is working properly, the numbers to be displayed +should be: + +R000: 056 R001: 066 R002: 052 R003: 065 R004: 043 R005: 023 R006: 016 +R007: 077 R008: 095 R009: 004 R010: 064 R011: 076 R012: 078 R013: 091 +R014: 013 R015: 029 R016: 044 R017: 014 R018: 041 R019: 042 R020: 020 +R021: 071 R022: 024 R023: 017 R024: 055 R025: 040 R026: 070 R027: 025 +R028: 031 R029: 019 R030: 073 R031: 002 R032: 047 R033: 009 R034: 035 +R035: 045 R036: 005 R037: 063 R038: 081 R039: 039 R040: 087 R041: 008 +R042: 053 R043: 093 R044: 049 R045: 092 R046: 061 R047: 046 R048: 026 +R049: 034 R050: 088 R051: 000 R052: 028 R053: 018 R054: 072 R055: 021 +R056: 037 R057: 082 R058: 006 R059: 058 R060: 096 R061: 068 R062: 069 +R063: 054 R064: 057 R065: 086 R066: 097 R067: 084 R068: 099 R069: 051 +R070: 098 R071: 003 R072: 074 R073: 062 R074: 080 R075: 033 R076: 011 +R077: 094 R078: 012 R079: 050 R080: 010 R081: 038 R082: 089 R083: 059 +R084: 022 R085: 079 R086: 015 R087: 007 R088: 075 R089: 083 R090: 060 +R091: 048 R092: 032 R093: 067 R094: 001 R095: 030 R096: 027 R097: 085 +R098: 090 R099: 036 R100: 100 + +These are the column choices for each row made by the algorithm. If +you see these numbers displayed, the algorithm is working correctly. + +*** The original debugging information was incorrect, as it not only +*** display the chosen columns, but also displayed eliminated columns. +*** Changed to show all 101 entries. Uwe F. Mayer <mayer@tux.edu> + +References + +Quantitative Decision Making for Business, Gordon, Pressman, and Cohn, +Prentice-Hall, Englewood Cliffs, NJ, 1990. + +Quantitative Decision Making, Guiseppi A. Forgionne, Wadsworth Publishing +Co., California, 1986. + +Huffman Compression + +Description + +This is a compression algorithm that -- while helpful for some time as a +text compression technique -- has since fallen out of fashion on account of +the superior performance by algorithms such as LZW compression. It is, +however, still used in some graphics file formats in one form or another. + +The benchmark consists of three parts: + +Building a "Huffman Tree" (explained below), + +Compression, and + +Decompression. + +A "Huffman Tree" is a special data structure that guides the compression and +decompression processes. If you were to diagram one, it would look like a +large binary tree (i.e., two branches per each node). Describing its +function in detail is beyond the scope of this paper (see the references for +more information). We should, however, point out that the tree is built from +the "bottom up"; and the procedure for constructing it requires that the +algorithm scan the uncompressed buffer, building a frequency table for all +the characters appearing in the buffer. (This version of the Huffman +algorithm compresses byte-at-a-time, though there's no reason why the same +principle could not be applied to tokens larger than one byte.) + +Once the tree is built, text compression is relatively straightforward. The +algorithm fetches a character from the uncompressed buffer, navigates the +tree based on the character's value, and produces a bit stream that is +concatenated to the compressed buffer. Decompression is the reverse of that +process. (We recognize that we are simplifying the algorithm. Again, we +recommend you check the references.) + +The Huffman Compression benchmark considers an iteration to be the three +operations described above, performed on an uncompressed text buffer of 5000 +bytes. It reports its results in iterations per second. + +Analysis + +Optimized 486 code (Watcom C/C++ 10.0): The Huffman compression algorithm -- +tree building, compression, and decompression -- is written as a single, +large routine: DoHuffIteration(). All the benchmark's time is spent within +that routine. + +Components of DoHuffIteration() that consume the most time are those that +perform the compression and decompression . + +The code for performing the compression spends most of its time (accounting +for about 13%) constructing the bit string for a character that is being +compressed. It does this by seeking up the tree from a leaf, emitting 1's +and 0's in the process, until it reaches the root. The stream of 1's and 0's +are loaded into a character array; the algorithm then walks "backward" +through the array, setting (or clearing) bits in the compression buffer as +it goes. + +Similarly, the decompression portion takes about 12% of the time as the +algorithm pulls bits out of the compressed buffer -- using them to navigate +the Huffman tree -- and reconstructs the original text. + +680x0 Code (Macintosh CodeWarrior): CodeWarrior's profiler is function +based. Consequently, it's impossible to get performance scores for +individual machine instructions. Furthermore, as mentioned above, the +Huffman compression algorithm is written as a monolithic routine. This makes +the results from the CodeWarrior profiler all the more sparse. + +We can at least point out that the lowmost routines (GetCompBit() and +SetCompBit()) that read and write individual bits, though called nearly 13 +million times each, account for only 0.7% and 0.3% of the total time, +respectively. + +Porting Considerations + +The Huffman algorithm relies on no special data types. It should port +readily. Global constants of interest include: + +EXCLUDED - This is a large, positive value. Currently it is set to 32000, +and should be left alone. Basically, this is a token that the system uses to +indicate an excluded character (one that does not appear in the plain-text). +It is set to a ridiculously high value that will never appear in the +pointers of the tree during normal construction. + +MAXHUFFLOOPS - This is another one of those "governor" constants. The +Huffman benchmark creates more work for itself by doing multiple +compression/decompression loops. This constant sets the maximum number of +loops it will attempt per iteration before it gives up. Currently, it is set +to 50000. Though it is unlikely you'll ever need to modify this value, you +can increase it if your machine is too fast for the adjustment algorithm. Do +not reduce the number. + +HUFFARRAYSIZE - This value sets the size of the plain-text array to be +compressed. You can override this value with the command file to see how +well your machine performs for larger or smaller arrays. The subsequent +results, however, are invalid for comparison with other systems. + +To test for correct execution of the benchmark: #define the symbol DEBUG, +recompile, build a command file that executes only the Huffman compression +algorithm, and run the benchmark. Defining DEBUG will enable a section of +code that verifies the decompression as it takes place (i.e., the routine +compares -- character at a time -- the uncompressed data with the original +plain-text). If there's an error, the program will repeatedly display: "Error +at textoffset xxx". + +** If everything is correct it will emit quite a few "Huffman: OK" messages. +** +** I added a resetting of the random number generator, outside of the +** timed loop, and a resetting of the Huffman tree, inside of the +** timed loop. That should help to make the benchmark more consistent. +** The program did originally only reset half of the tree, which lead +** to runtime errors on some systems. The effect on the benchmark +** should be negligible, and in fact comes out as being of the order +** of less than 1% on my test system. +** Uwe F. Mayer <mayer@tux.edu> + +References + +Data Compression: Methods and Theory, James A. Storer, Computer Science +Press, Rockville, MD, 1988. + +An Introduction to Text Processing, Peter D. Smith, MIT Press, Cambridge, +MA, 1990. + +IDEA Encryption + +Description + +This is another benchmark based on a "higher-level" algorithm; "higher +-level" in the sense that it is more complex than a sort or a search +operation. + +Security -- and, therefore, cryptography -- are becoming increasingly +important issues in the computer realm. It's likely that more and more +machines will be running routines like the IDEA encryption algorithm. (IDEA +is an acronym for the International Data Encryption Algorithm.) + +A good description of the algorithm (and, in fact, the reference we used to +create the source code for the test) can be found in Bruce Schneier's +exhaustive exploration of encryption, "Applied Cryptography" (see +references). To quote Mr. Schneier: "In my opinion, it [IDEA] is the best +and most secure block algorithm available to the public at this time." + +IDEA is a symmetrical, block cipher algorithm. Symmetrical means that the +same routine used to encrypt the data also decrypts the data. A block cipher +works on the plain-text (the message to be encrypted) in fixed, discrete +chunks. In the case of IDEA, the algorithm encrypts and decrypts 64 bits at +a time. + +As pointed out in Schneier's book, there are three operations that the IDEA +uses to do its work: + +XOR (exclusive-or) + +Addition modulo 216 (ignoring overflow) + +Multiplication modulo 216+1 (ignoring overflow). + +IDEA requires a key of 128 bits. However, keys and blocks are further +subdivided into 16-bit chunks, so that any given operation within the IDEA +encryption is performed on 16-bit quantities. (This is one of the many +advantages of the algorithm, it is efficient even on 16-bit processors.) + +The IDEA benchmark considers an "iteration" to be an encryption and +decryption of a buffer of 4000 bytes. The test actually builds 3 buffers: +The first to hold the original plain-text, the second to hold the encrypted +text, and the third to hold the decrypted text (the contents of which should +match that of the first buffer). It reports its results in iterations per +second. + +Analysis + +Optimized 486 code: The algorithm actually spends most of its time (nearly +75%) within the mul() routine, which performs the multiplication modulo +216+1. This is a super-simple routine, consisting primarily of if +statements, shifts, and additions. + +The remaining time (around 24%) is spent in the balance of the cipher_idea() +routine. (Note that cipher_idea() calls the mul() routine frequently; so, +the 24% is comprised of the other lines of cipher_idea()). cipher_idea() is +littered with simple pointer-fetch-and-increment operations, some addition, +and some exclusive-or operations. + +Note that IDEA's exercise of system capabilities probably doesn't extend +beyond testing simple integer math operations. Since the buffer size is set +to 4000 bytes, the test will run entirely in processor cache on most +systems. Even the cache won't get a heavy "internal" workout, since the +algorithm proceeds sequentially through each buffer from lower to higher +addresses. + +680x0 code (Macintosh CodeWarrior): CodeWarrior's profiler is function +based; consequently, it is impossible to determine execution profiles for +individual machine instructions. We can, however, get an idea of how much +time is spent in each routine. + +As with Huffman compression, the IDEA algorithm is written monolithically -- +a single, large routine does most of the work. However, a special +multiplication routine, mul(), is frequently called within each +encryption/decryption iteration (see above). + +In this instance, the results for the 68K system diverges widely from those +of the 486 system. The CodeWarrior profiler shows the mul() routine as +taking only 4% of the total time in the benchmark, even though it is called +over 20 million times. The outer routine is called 600,000 times, and +accounts for about 96% of the whole program's entire time. + +Porting Considerations + +Since IDEA does its work in 16-bit units, it is particularly important that +u16 be defined to whatever datatype provides an unsigned 16-bit integer on +the test platform. Usually, unsigned short works for this. (You can verify +the size of a short by running the benchmarks with a command file that +includes ALLSTATS=T as one of the commands. This will cause the benchmark +program to display a message that tells the size of the int, short, and long +data-types in bytes.) + +Also, the mul() routine in IDEA requires the u32 datatype to define an +unsigned 32-bit integer. In most cases, unsigned long works. + +To test for correct execution of the benchmark: #define the symbol DEBUG, +recompile, build a command file that executes only the IDEA algorithm, and +run the benchmark. Defining DEBUG will enable a section of code that +compares the original plain-text with the output of the test. (Remember, the +benchmark performs both encryption and decryption.) If the algorithm has +failed, the output will not match the input, and you'll see "IDEA Error" +messages all over your display. + +References + +Applied Cryptography: Protocols, Algorithms, and Source Code in C, Bruce +Schneier, John Wiley & Sons, Inc., New York, 1994. + +Neural Net + +Description + +The Neural Net simulation benchmark is based on a simple back-propagation +neural network presented by Maureen Caudill as part of a BYTE article that +appeared in the October, 1991 issue (see "Expert Networks" in that issue). +The network involved is a simple 3-layer (input neurodes, middle-layer +neurodes, and output neurodes) network that accepts a number of 5 x 7 input +patterns and produce a single 8-bit output pattern. + +The test involves sending the network an input pattern that is the 5 x 7 +"image" of a character (1's and 0's -- 1's representing lit pixels, 0's +representing unlit pixels), and teaching it the 8-bit ASCII code for the +character. + +A thorough description of how the back propagation algorithm works is beyond +the scope of this paper. We recommend you search through the references +given at the end of this paper, particularly Ms. Caudill's article, for +detailed discussion. In brief, the benchmark is primarily an exercise in +floating-point operations, with some frequent use of the exp() function. It +also performs a great deal of array references, though the arrays in use are +well under 300 elements each (and less than 100 in most cases). + +The Neural Net benchmark considers an iteration to be a single learning +cycle. (A "learning cycle" is defined as the time it takes the network to be +able to associate all input patterns to the correct output patterns within a +specified tolerance.) It reports its results in iterations per second. + +Analysis + +Optimized 486 code: The forward pass of the network (i.e., calculating +outputs from inputs) utilize a sigmoid function. This function has, at its +heart, a call to the exp() library routine. A small but non-negligible +amount of time is spent in that function (a little over 5% for the 486 +system we tested). + +The learning portion of the network benchmark depends on the derivative of +the sigmoid function, which turns out to require only multiplications and +subtractions. Consequently, each learning pass exercises only simple +floating-point operations. + +If we divide the time spent in the test into two parts -- forward pass and +backward pass (the latter being the learning pass) -- then the test appears +to spend the greatest part of its time in the learning phase. In fact, most +time is spent in the adjust_mid_wts() routine. This is the part of the +routine that alters the weights on the middle layer neurodes. (It accounts +for over 40% of the benchmark's time.) + +680x0 Code (Macintosh CodeWarrior): Though CodeWarrior's profiler is +function based, the neural net benchmark is highly modular. We can therefore +get a good breakdown of routine usage: + +worst_pass_error() - 304 microsecs (called 4680 times) + +adjust_mid_wts() - 83277 microsecs (called 46800 times) + +adjust_out_wts() - 17394 microsecs (called 46800 times) + +do_mid_error() - 11512 microsecs (called 46800 times) + +do_out_error() - 3002 microsecs (called 46800 times) + +do_mid_forward() - 49559 microsecs (called 46800 times) + +do_out_forward() - 20634 microsecs (called 46800 times) + +Again, most time was spent in adjust_mid_wts() (as on the 486), accounting +for almost twice as much time as do_mid_forward(). + +Porting Consideration + +The Neural Net benchmark is not dependent on any special data types. There +are a number of global variables and arrays that should not be altered in +any way. Most importantly, the #defines found in NBENCH1.H under the Neural +Net section should not be changed. These control not only the number of +neurodes in each layer; they also include constants that govern the learning +processes. + +Other globals to be aware of: + +MAXNNETLOOPS - This constant simply sets the upper limit on the number of +training loops the test will permit per iteration. The Neural Net benchmark +adjusts its workload by re-teaching itself over and over (each time it +begins a new training session, the network is "cleared" -- loaded with +random values). It is unlikely you will ever need to modify this constant. + +inpath - This string pointer is set to the path from which the neural net's +input data is read. It is currently hardwired to "NNET.DAT". You shouldn't +have to change this name, unless your file system requires directory +information as part of the path. + +Note that the Neural Net benchmark is the only test that requires an +external data file. The contents of the file are listed in an attachment to +this paper. You should use the attachment to reconstruct the file should it +become lost or corrupted. Any changes to the file will invalidate the test +results. + +To test for correct execution of the benchmark: #define the symbol DEBUG, +recompile, build a command file that executes only the Neural Net test, and +run the benchmark. Defining DEBUG will enable a section of code that +displays how many passes through the learning process were required for the +net to learn. It should learn in 780 passes. + +References + +"Expert Networks," Maureen Caudill, BYTE Magazine, October, 1991. + +Simulating Neural Networks, Norbert Hoffmann, Verlag Vieweg, Wiesbaden, +1994. + +Signal and Image Processing with Neural Networks, Timothy Masters, John +Wiley and Sons, New York, 1994. + +Introduction to Neural Networks, Jeannette Stanley, California Scientific +Software, CA, 1989. + +LU Decomposition + +Description + +LU Decomposition is an algorithm that can be used as the heart of a program +for solving linear equations. Suppose you have a matrix A. LU Decomposition +determines the matrices L and U such that + +L . U = A + +where L is a lower triangular matrix and U is an upper triangular matrix. (A +lower triangular matrix has nonzero elements only on the main diagonal and +below. An upper triangular matrix has nonzero elements only on the main +diagonal and above.) + +Without going into the mathematical details too deeply, having the L and U +matrices makes the solution of linear equations (i.e., equations of the form +A . x = b) quite easy. It turns out that you can also use LU decomposition +to determine matrix inverses and determinants. + +The algorithm used in the benchmarks was derived from Numerical Recipes in +Pascal (there is a C version of the book, which we did not have on hand), a +book we heartily recommend to anyone serious about mathematical and +scientific computing. The authors are approving of LU decomposition as a +means of solving linear equations, pointing out that their version (which +makes use of what we would have to call "Crout's method with partial +implicit pivoting") is a factor of 3 better than one of their Gauss-Jordan +routines, a factor of 1.5 better than another. They go on to demonstrate the +use of LU decomposition for iterative improvement of linear equation +solutions. + +The benchmark begins by creating a "solvable" linear system. This is easily +done by loading up the column vector b with random integers, then +initializing A with an identity matrix. The equations are then "scrambled" +by either multiplying a row by a constant, or adding one row to another. The +scrambled matrices are handed to the LU algorithm. + +The LU Decomposition benchmark considers a single iteration to be the +solution of one set of equations (the size of A is fixed at 101 x 101 +elements). It reports its results in iterations per second. + +Analysis + +Optimized 486 code (Watcom C/C++ 10.0): The entire algorithm consists of two +parts: the LU decomposition itself, and the back substitution algorithm that +builds the solution vector. The majority of the algorithm's time takes place +within the former; the algorithm that builds the L and U matrices (this +takes place in routine ludcmp()). + +Within ludcmp(), there are two extremely tight for loops forming the heart +of Crout's algorithm that consume the majority of the time. The loops are +"tight" in that they each consist of only one line of code; in both cases, +the line of code is a "multiply and accumulate" operation (actually, it's +sort of a multiply and de-accumulate, since the result of the multiplication +is subtracted, not added). + +In both cases, the items multiplied are elements from the A array; and one +factor's row index is varying more rapidly, while another factor's column +index is varying more rapidly. + +Note that this is a good overall test of floating-point operations within +matrices. Most of the math is floating-point; primarily additions, +subtractions, and multiplications (only a few divisions). + +680x0 Code (Macintosh CodeWarrior): CodeWarrior's profiler is function +based. It is therefore impossible to determine execution profiles at the +machine-code level. The profiler does, however, allow us to determine how +much time the benchmark spends in each routine. This breakdown is as +follows: + +lusolve() - 3.4 microsecs (about 0% of the time) + +lubksb() 1198 microsec (about 2% of the time) + +ludcmp() - 63171 microsec (about 91% of the time) + +The above percentages are for the whole program. Consequently, as a portion +of actual benchmark time, the amount attributed to each will be slightly +larger (though the proportions will remain the same). + +Since ludcmp() performs the actual LU decomposition, this is exactly where +we'd want the benchmark to spend its time. The lubksb() routine calls +ludcmp(), using the resulting matrix to "back-solve" the linear equation. + +Porting Considerations + +The LU Decomposition routine requires no special data types, and is immune +to byte ordering. It does make use of a typedef (LUdblptr) that includes an +embedded union; this allows the benchmark to "coerce" a pointer to double +into a pointer to a 2D array of double. This arrangement has not caused +problems with the compilers we have tested to date. + +Other constants and globals to be aware of: + +LUARRAYROWS and LUARRAYCOLS - These constants set the size of the +coefficient matrix, A. They cannot be altered by command file. In fact, you +shouldn't alter them at all, or your results will be invalid. Currently, +they are both set to 101. + +MAXLUARRAYS - This is another "governor" constant. The algorithm performs +dynamic workload adjustment by building more and more arrays to solve per +timing round. This sets the maximum upper limit of arrays that it will +build. Currently, it is set to 1000, which should be more than enough for +the reasonable future (1000 arrays of 101 x 101 floating-point doubles would +require somewhere around 80 megabytes of RAM -- and that's not counting the +column vectors). + +To test for correct execution of the benchmark: Currently, there is no +simple technique for doing this. You can, however, either use your favorite +debugger (or embed a printf() statement) at the conclusion of the lubksb() +routine. When this routine concludes, the array b will hold the solution +vector. These items will be stored as floating-point doubles, and the first +14 are (with rounding): + +46 20 23 22 85 86 97 95 8 89 75 67 6 86 + +If you find these numbers as the first 14 in the array b[], then you're +virtually guaranteed that the algorithm is working correctly. + +*** The above is not correct, as the initial matrix is not the identity, +*** but a matrix with random nonzero entries on the diagonal (they have +*** altered the algorithm since they wrote the documentation). +*** I changed the output of the debugging routine, it now prints first +*** what the array b should hold (as righthand side divided by diagonal +*** entry), and then it prints what the array b does hold after the +*** decomposition has been done to compute the solution of the system. If +*** you get the same, then fine. +*** And, by the way, my original right hand sides are +*** 46 23 85 97 8 75 6 81 88 76 6 84 31 53 2 ... +*** and the diagonal entries are +*** 520 922 186 495 89 267 786 571 175 600 738 321 897 541 859 ... +*** You notice that one has every other number of the original sequence. +*** This is due to BYTE's change of the algorithm, as they now also use the +*** random number generator to generate the diagonal elements. +*** Here is the complete set of data: +*** 46/520=0.09 23/922=0.02 85/186=0.46 97/495=0.20 8/89=0.09 +*** 75/267=0.28 6/786=0.01 81/571=0.14 88/175=0.50 76/600=0.13 +*** 6/738=0.01 84/321=0.26 31/897=0.03 53/541=0.10 2/859=0.00 +*** 86/92=0.93 51/121=0.42 29/248=0.12 51/789=0.06 84/6=14.00 +*** 21/180=0.12 33/48=0.69 2/899=0.00 12/820=0.01 69/372=0.19 +*** 59/809=0.07 74/18=4.11 40/788=0.05 39/56=0.70 86/91=0.95 +*** 33/878=0.04 82/165=0.50 42/561=0.07 8/274=0.03 84/694=0.12 +*** 32/352=0.09 25/969=0.03 59/816=0.07 33/112=0.29 5/125=0.04 +*** 89/740=0.12 7/223=0.03 54/994=0.05 33/80=0.41 55/676=0.08 +*** 6/524=0.01 36/544=0.07 21/160=0.13 58/596=0.10 15/717=0.02 +*** 84/311=0.27 98/530=0.18 46/713=0.06 41/233=0.18 73/640=0.11 +*** 40/343=0.12 72/586=0.12 100/965=0.10 59/764=0.08 37/866=0.04 +*** 27/682=0.04 3/652=0.00 41/352=0.12 87/786=0.11 45/79=0.57 +*** 83/761=0.11 41/817=0.05 46/209=0.22 78/930=0.08 85/210=0.40 +*** 80/756=0.11 18/931=0.02 30/669=0.04 47/127=0.37 85/891=0.10 +*** 66/364=0.18 83/955=0.09 58/637=0.09 58/778=0.07 82/288=0.28 +*** 42/540=0.08 76/290=0.26 59/36=1.64 29/463=0.06 63/476=0.13 +*** 6/340=0.02 73/341=0.21 59/737=0.08 81/492=0.16 98/443=0.22 +*** 58/32=1.81 53/562=0.09 54/263=0.21 46/367=0.13 58/390=0.15 +*** 96/845=0.11 30/746=0.04 2/687=0.00 28/849=0.03 84/180=0.47 +*** 85/382=0.22 +*** Uwe F. Mayer <mayer@tux.edu> + +References + +Numerical Recipes in Pascal: The Art of Scientific Computing, Press, +Flannery, Teukolsky, Vetterling, Cambridge University Press, New York, 1989. |