TJR Forum

Home arrow Articles arrow Articles arrow The 64-bit Question: AMD64 vs. i386
The 64-bit Question: AMD64 vs. i386 PDF Print E-mail
Written by Jem Matzan   
Dec 09, 2004 at 06:10 PM

There have been many reviews and articles about the AMD64 architecture, but thus far there have been no performance reviews that show the advantage of 64-bit. Likewise there have been no comparisons between AMD's new 64-bit processors and Intel's Pentium4 Prescott core processors when taking full advantage of their technologies. Why do people benchmark the Athlon 64 and Opteron in i386 mode? Why do people benchmark the P4 processor with Hyper-Threading turned off? When people buy or build systems with these technologies, they expect to use them -- and they want to know how they compare to one another. Today all of that will change; I have the world's first 32 and 64-bit performance comparison between the Athlon64 3200+ and the Pentium4 3.2E.

Introduction/Disclaimer

To begin with, there are only a handful of solid conclusions that can be drawn from this benchmarking project, and those are listed at the end in the conclusions section. You must remember that the numbers printed in this review are only good for comparing performance in FreeBSD; other operating systems are almost guaranteed to have different ways of optimizing code and could suggest different conclusions about the raw performance of the hardware in question. So what I'm saying is, this information is true in FreeBSD 5.2.1-RELEASE with the stated hardware configurations. Any deviation from this configuration or software could reveal different results that sway superiority in a different direction, although my experience tells me that such vastly different data would be highly unusual. In general, you can probably assume that the data in this article will be indicative of performance on any operating system or with any similar hardware using identical technologies, but I can't guarantee that. Only further testing with different software will reveal the entire picture.

In this and several other benchmarking projects done with a variety of software in different operating systems I have never seen any performance advantage to turning off Hyper-Threading Technology in a Pentium4 processor. I don't know where the rumors of weaker performance with HT are coming from -- and I don't care (please don't email me with links to other people's reviews of synthetic tests in Windows) -- but you should give them little or no credence, or at very least be skeptical of them.

The Hardware

For this article I acquired the following hardware to use for two systems. They shared the same optical drive, hard drive, video card, RAM, cables, power supply, and chassis. Only the motherboard and CPU were changed to switch from the AMD machine to the Intel machine. This was done to prevent variations that could be caused by hardware manufacturing flaws or differences in output due to brand.

  • Asus K8V Deluxe
  • AMD Athlon64 3200+
  • Thermaltake K8 Silent Boost HSF
  • Intel D875PBZ (rev. 301)
  • Intel Pentium4 3.2E
  • Corsair PC3200 TwinX-LL 1024MB kit
  • Western Digital 36GB Raptor SATA hard drive
  • Sony DDU1621 DVD-ROM
  • ATI Radeon 9800Pro All-In-Wonder 128MB
  • Antec TrueBlue 480w power supply
  • Skyhawk Galaxy case with front, rear, side, and cowl fans

The CPUs were sent to me by their respective manufacturers. The AMD processor is a standard OEM edition available to computer manufacturers and through some retail channels. The Intel processor is a "Confidential" series pre-release sample. I know I said I would never write a review of pre-release hardware, but that was in reference to motherboards, which can often perform better than their future retail box incarnation (for example the Asus P4S8X and all of the Intel E7205-based motherboard pre-releases were significantly better performers than the final release edition). My experience with pre-release CPUs (of which I have seen and tested four) is that they are always of the same technology as the final retail or OEM editions. It's one thing to make a super-powered motherboard for the press; it's entirely another thing to modify a CPU for the same nefarious purposes. To be certain, I asked my press contact at Intel if there have been any changes since the Confidential pre-release samples, and he said that they were technologically identical to the consumer release, with the exception of errata corrections, stepping numbers and the multiplier lock (meaning I can change the clock multiplier in the BIOS, a feature which is disabled in the consumer release to discourage overclocking). I trust Intel's word on the matter because my experience with them has indicated that they're not liars when it comes to issues like this. Just the same, the possibility exists that my results can be different than those obtained with a retail box P4 3.2E, but if that is the case the numbers would likely be in favor of the retail processor, not the pre-release sample. Again, we're talking about a possibility (in a random Universe, anything is possible) attached to a small likelihood of a significant performance difference.

The heatsink/fan (HSF) unit that I used for the Intel processor came from Intel. It's a modified version of the traditional socket478 fan, except it has a larger copper core than the previous edition and the fins on the heatsink are in a sort of star pattern. The locking mechanism is the same. The heatsink compound I used was already on the bottom of the heatsink in a small gray pad. Intel provided a syringe of extra compound, but I didn't have cause to use it.

The Thermaltake K8 Silentboost is an excellent solid-copper HSF. It's a good thing, too -- it was my only choice. It seems there aren't (or weren't when I bought this unit two months ago) many manufacturers that make HSFs for AMD64 processors. For the Athlon64 processor I used the standard-issue white heatsink compound, which is verified and certified by AMD.

The RAM was sent to me by Corsair for this and other benchmarking projects. It is the same retail box kit that you can buy through any authorized reseller. I could have requested RAM from a number of other manufacturers, but I chose Corsair for its high level of compatibility with motherboards, its low latency and reliable performance.

The WD Raptor is the fastest SATA drive on the market, and I acquired it for this and other benchmarking projects.

The Radeon 9800Pro AIW was sent to me by ATI for a previous review of SciTech's SNAP Graphics drivers. I chose it for this review because it is a reasonable choice for a high-end single CPU workstation, and because although I didn't do any graphics testing in this review, I plan on doing several graphic-intensive benchmarking projects in the future and I will be using this card for those tests. It helps to keep things as standard as possible to maintain cross-compatibility with my reviews.

The Antec TrueBlue 480 is both quiet and powerful. I had a long internal debate over whether I should get this power supply or the Vantec Stealth 420. Both are excellent supplies, but in the end I decided that the slightly cheaper Antec would be a better choice for this project because of its automatic fan control (the Vantec has a manual switch on the back) and its higher voltage and amperage ratings. The blue LED is superfluous -- you can hardly see it when the case covers are on.

The Skyhawk case was a poor choice, but it's all I had available to me. It is not FCC approved because of the acrylic window in the side, although I eliminated that variable by leaving the side cover off for all of my testing. This was also to improve ventilation and maintain a more consistent temperature inside the system. This case is totally unsuitable for a system based on the Prescott core because of its high operating temperature; despite all of the fans it has in it, ten seconds of idle operation with the side cover on forced the CPU fan to speeds of nearly 5000RPM.

Each system was assembled with care and all wires and connectors were correctly connected according to the manual. The BIOS was adjusted as necessary and the RAM sticks were in the proper slots for best performance. I decided to conduct my tests in a real computer chassis because, oddly, no other reviewers seem to do that. They bare-board everything, which means that they will never discover problems intrinsic to chassis assembly, such as the trouble I had with the Prescott system's fan noise. This benchmarking project was designed to closely mimic a real workstation system, not a fictional lab testing environment.

The Software

The operating system I used was FreeBSD 5.2.1-RELEASE. If you'd like to learn more about how I configured the operating system and how I devised my benchmarking methods, or if you'd like to learn how to benchmark hardware using FreeBSD, I've written a separate article about it here.

I used the standard Unix time command to conduct stopwatch tests, stream and ubench for synthetic tests, and OpenSSL, oggenc, and cdparanoia for my real-world tests. I did not conduct any testing in X -- that would be a totally separate review, and the research and testing for it have already begun.

I did not generate statistics because it's not easy to do with three test cases and I felt that they were unnecessary anyway. I also didn't make any graphics to show differences in performance. If you want to see pretty graphs that mislead readers and suggest flawed conclusions, you'll be disappointed with this review. You shouldn't need a graph or chart to put this data in perspective anyway -- it's pretty straightforward.

I tested everything with both the ULE and the 4BSD scheduler. Overall I found the 4BSD scheduler to be measurably faster in most cases. This is in contrast to the general belief that ULE is "faster" than 4BSD. To include a comparison of the schedulers would make this article a bit too long for most people's tastes, so I wrote a separate review for the scheduler comparison, and it can be found here. For the purpose of this article I chose to use the numbers from the 4BSD scheduler because they were generally faster for all three test cases and because the 4BSD scheduler is more mature and more likely to be used in a production environment (or even for regular desktop use).

Stopwatch Tests

All time is listed in seconds and each number represents the mean average of the real time (the total elapsed time), user time (the time it takes to execute the utility), and system overhead time of three distinct test iterations.

It's simple: I timed how long it took to compile the base system with varying numbers of concurrent processes. I also compiled Apache version 2.0.48_3 using no concurrent processes. The latter is a much shorter test and the code is the same for both architectures, whereas there is some debate over whether it is fair to compare an AMD64 buildworld time and an i386 buildworld time because of the variation in the code. According to David O'Brien -- the FreeBSD developer in charge of the AMD64 architecture -- about 80% of the AMD64 instruction set is the i386 instruction set, and most of the code generation code is shared between the two, so if there's a difference it isn't a big one. But the code is different and thus it isn't fair to directly compare the times.

By themselves these compile times are not a reliable indicator of performance because of the dependence on the compiler (GCC, in this case), but when viewed as part of a benchmarking project these results have their place. After all, compile times are very important to a lot of people regardless of the reasons why the numbers are different. So keep in mind that we're benchmarking GCC (version 3.2.1) performance as well as system performance with these tests. The AMD64 code in GCC of course is not nearly as mature as the i386 code, so right off the bat the Athlon64/AMD64 times are going to be slower than they could be.

For the Apache2 build test I built the port and let it download and install all of the necessary dependencies. I then uninstalled Apache2 only -- leaving the dependencies in place and the downloaded source code in the distfiles directory -- and restarted in single-user mode, where Apache2 was rebuilt and timed. The time includes clean time; the exact command was time make install clean

Buildworld Real Time
Concurrent Processes Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
-j2 2346.29 2290.25 2132.81 Athlon64/AMD64
-j3 2096.52 2096.63 2019.88 Athlon64/AMD64
-j4 1999.00 2016.37 1965.46 Athlon64/AMD64

The User and System times are not as important as the Real time; the Real time is the total time elapsed for the test, and it's really the only time that matters. But the other scores show some interesting results:

Buildworld User Time
Concurrent Processes Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
-j2 2221.36 1420.85 1435.54 Athlon64/i386
-j3 2365.02 1427.53 1438.23 Athlon64/i386
-j4 2416.14 1427.68 1439.45 Athlon64/i386
Buildworld System Time
Concurrent Processes Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
-j2 408.69 246.61 372.80 Athlon64/i386
-j3 445.87 263.92 389.79 Athlon64/i386
-j4 465.16 272.46 396.17 Athlon64/i386

The Pentium4 times appear to be impossible; according to the numbers for -j3 and -j4, it takes longer to execute the utility than it does to complete the entire process. This is due to Hyper-Threading -- the user and system times are exactly twice what they would otherwise be because there are two virtual CPUs. Notice also that the Pentium4 isn't significantly faster than the Athlon64/i386 despite the advantage of Hyper-Threading. This could be due to the vastly different design (longer pipeline, more latency, smaller pathways, etc. -- see my article on Prescott technology for details) in the Prescott core.

Apache 2 Real Time
Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
153.19 121.09 137.91 Athlon64/i386
Apache 2 User Time
Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
82.38 73.04 78.93 Athlon64/i386
Apache 2 System Time
Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
55.84 33.10 50.99 Athlon64/i386

It's worth noting that there is a reproducible anomaly in the data for the Athlon64/i386 in which the ULE scheduler provides faster scores than the 4BSD scheduler in all cases. If I were to take the best scores between the two schedulers for each test, the Athlon64/i386 would be about two seconds faster than the Pentium4 in the buildworld -j4 test. Making that comparison between schedulers is not a fair test, however, because the test conditions for all machines would not be equal.

What you can't see with the averages I've put in above is that the Intel system has a larger degree of fluctuation in the times, whereas the AMD system in either mode produces far more consistent results. The times were particularly wacky (as much of a deviation as 8 seconds) as more concurrent processes were added to the buildworld test. It's my belief that the abnormally high operating temperature of the Pentium4 Prescott core (60 degrees C at idle) is to blame for its inconsistency, although there could also be some mistakes in the still-not-ready-for-primetime SMPng code that are providing suboptimal performance in a Hyper-Threaded system. The CPU temperature noticeably fluctuates under load when in a room temperature environment. Although I can't measure it while running tests, I can hear the CPU fan increase in RPMs over the length of the test. I would guess that the RPMs increase by roughly 1000RPMs or more, judging by the sound.

Synthetic Benchmarks

Synthetic tests can reveal information that we might not otherwise be able to obtain, but in general you should not put a lot of stock in them. If they were accurate I would think them more useful as a method of determining performance, but their numbers are not consistent with the other results from stopwatch and real-world testing. Part of the reason why I am printing these results is to show just how misleading a synthetic test can be -- take heed all ye synthetic benchmark monkeys.

Comparing synthetic benchmark numbers is only meaningful when comparing them with the same synthetic benchmarks with the same configuration. The memory bandwidth test is actually useful, but I'm sure it wasn't configured properly when I used it.

Stream measures memory bandwidth, although I don't trust it. I tried to contact the author of the benchmark to get his help in properly configuring this software, but after one email he abandoned me. The Stream website won't accept a connection from my machine (or are they blocking my ISP? Or is it IE-only? I'll never know), the Google cache didn't reveal the answers I needed and I had no access to reasonable documentation. The below numbers are the bandwidth rates in megabytes per second. I already know from testing these systems with SiSoft Sandra in 32-bit Windows that these numbers are not accurate (they seem to be exactly 1000 MB/sec too low), but they appear to be uniformly inaccurate as far as I can tell, so they are still useful as a point of comparison. Please note that I modified the Makefile according to the only directions on Stream that I could find. Please see the aforementioned FreeBSD benchmarking article for details.

Stream
Pentium4 Athlon64/i386 Athlon64/AMD64 Fastest?
Copy 3413.33 2048 2048 Pentium4
Scale 3413.33 2048 2048 Pentium4
Add 3072 2194.29 1920 Pentium4
Triad 3072 1920 1920 Pentium4

As far as I can tell from other testing, the Pentium4 should indeed have higher bandwidth because of the higher core frequency speed and the dual-channel memory controller, although the Athlon64 should have lower latency because of the on-die memory controller. There was no need to average the results of all three test runs because the results were identical each time for all three systems.

The ubench test is an old Unix benchmark test, kind of the "old standby," like 3DMark in the Windows world. It is absolutely meaningless -- it produces a number that rates the CPU and RAM performance, and that number is only useful when comparing it to other ubench numbers.

It seems to be quite buggy, as I never once got it to complete its testing procedure. It would do the CPU test and then exit on a signal 6 (in 64-bit mode) or signal 11 (in i386 mode) when doing the memory test. Although it compiles for AMD64, it doesn't produce the kind of result that one would expect (based on the other tests in this project) in 64-bit mode.

Ubench CPU
Pentium4 Athlon64/i386 Athlon64/AMD64 Fastest?
137239 97348 71352 Pentium4

Now obviously that number for Athlon64/AMD64 is not accurate, but unfortunately you don't discover these kinds of problems until after the testing is over with. Still, maybe it'll help the ubench people fix their bugs.

Real-World Tests

This is the most useful of all of the data I collected because it shows how a system will perform in real-world scenarios. I didn't test a lot of different programs here because many of the tests that I think would be best must be performed in X11. I tried ripping a CD with cdparanoia, but the results were too close to say that there was a meaningful difference between systems. I attribute that to the CPUs being able to handle more work than the optical drive could provide for them. Just in case you're curious, the CD I tested with is LA Woman by The Doors, and it took roughly 660 seconds to rip it to the hard drive. From there I encoded the tracks with oggenc from the vorbis-tools port. The times below are, as above, listed in seconds and they represent mean averages from three separate testing runs. The exact command was time oggenc * and it was run in a directory containing only the ripped WAV files from the cdparanoia test.

Oggenc Real Time
Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
249.64 258.54 170.10 Athlon64/AMD64

Again, the User and System times are not as important as the Real time listed above; the Real time is the total time elapsed for the test, and it's really the only time that matters.

Oggenc User Time
Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
247.87 256.46 168.35 Athlon64/AMD64
Oggenc System Time
Pentium4 Time Athlon64/i386 Time Athlon64/AMD64 Time Fastest?
1.05 0.94 0.67 Athlon64/AMD64

There is quite a big difference between the 64-bit and 32-bit times, as you can see. It is abundantly clear that for encoding with oggenc, 64-bit makes a very significant difference in performance. Although I didn't test other encoders, I shall put forth the hypothesis that all audio and video encoding (and probably decoding) would be fastest (among the given machines in this review) with the Athlon64 in 64-bit mode.

Lastly I used OpenSSL from the FreeBSD base system as a test (click here for the OpenSSL documentation). The output was piped to a text file for each run. The exact command used was openssl speed >run1.txt replacing the number in the text file name to correspond with the number of the test run.

I selected the first run from each 4BSD testing group for publication. There were some variances, but I didn't notice a significant difference in the results and there was no way I was going to take the time to do averages on all of these numbers.

Pentium4 3.2E

OpenSSL 0.9.7c 30 Sep 2003
built on: Mon Mar 1 21:51:43 EST 2004
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) blowfish(idx)
compiler: cc
available timing options: USE_TOD HZ=128 [sysconf value]
timing function used: getrusage
The 'numbers' are in 1000s of bytes per second processed.

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md2 2016.84k 4322.17k 6070.05k 6751.71k 6975.67k
mdc2 5187.62k 6309.10k 6663.01k 6842.96k 6876.25k
md4 14902.30k 56885.19k 164041.24k 360880.61k 557337.75k
md5 13533.88k 46715.49k 127714.46k 242926.90k 329674.50k
hmac(md5) 7378.81k 27055.24k 85550.47k 197721.01k 317374.48k
sha1 11164.95k 30949.10k 69059.28k 101680.00k 117127.47k
rmd160 9379.24k 26992.07k 53580.32k 78348.85k 89972.69k
rc4 131738.10k 147298.92k 149236.43k 150199.64k 150488.26k
des cbc 29414.18k 31502.57k 33198.69k 33733.32k 34911.05k
des ede3 7834.13k 8009.96k 8034.31k 7928.99k 7964.91k
idea cbc 0.00 0.00 0.00 0.00 0.00
rc2 cbc 23213.44k 24324.99k 24457.96k 24481.34k 24506.93k
rc5-32/12 cbc 117025.06k 136453.15k 135906.90k 136299.75k 136531.75k
blowfish cbc 63282.39k 65804.80k 66005.56k 66382.67k 66462.08k
cast cbc 53759.70k 58726.38k 59652.19k 60284.82k 60165.25k
aes-128 cbc 79154.59k 78949.15k 81544.36k 81947.53k 82485.71k
aes-192 cbc 70636.01k 70085.72k 71894.33k 72926.05k 73213.11k
aes-256 cbc 62936.16k 63083.62k 64930.03k 65524.64k 65603.78k
sign verify sign/s verify/s
rsa 512 bits 0.0012s 0.0001s 819.1 8892.4
rsa 1024 bits 0.0068s 0.0003s 147.2 3016.8
rsa 2048 bits 0.0400s 0.0011s 25.0 945.4
rsa 4096 bits 0.2531s 0.0038s 4.0 264.0
sign verify sign/s verify/s
dsa 512 bits 0.0011s 0.0013s 926.0 795.7
dsa 1024 bits 0.0032s 0.0038s 311.3 264.5
dsa 2048 bits 0.0103s 0.0122s 97.0 82.1

Athlon64/i386

OpenSSL 0.9.7c 30 Sep 2003
built on: Mon Mar 1 21:51:43 EST 2004
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) blowfish(idx)
compiler: cc
available timing options: USE_TOD HZ=128 [sysconf value]
timing function used: getrusage
The 'numbers' are in 1000s of bytes per second processed.

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md2 1638.38k 3478.62k 4841.21k 5365.51k 5537.52k
mdc2 4254.16k 4813.89k 4961.38k 5002.23k 5012.39k
md4 14880.30k 51939.53k 148904.64k 278367.22k 375362.23k
md5 11737.69k 38977.26k 103199.60k 172939.25k 215400.27k
hmac(md5) 6663.99k 23743.44k 72134.23k 146192.86k 209540.45k
sha1 12623.98k 37535.03k 85321.96k 125001.94k 144802.48k
rmd160 9267.16k 26074.68k 54288.80k 74601.53k 83762.21k
rc4 104211.17k 109806.94k 111064.92k 112103.17k 111971.22k
des cbc 21634.75k 22290.56k 22437.69k 22491.36k 22506.44k
des ede3 7844.08k 7898.53k 7924.55k 7931.19k 7932.89k
idea cbc 0.00 0.00 0.00 0.00 0.00
rc2 cbc 16947.92k 17566.40k 17732.03k 17761.42k 17877.29k
rc5-32/12 cbc 90145.55k 105133.21k 109166.02k 110619.16kM 111020.08k
blowfish cbc 53110.73k 57511.04k 58590.09k 58910.42k 59084.57k
cast cbc 35109.22k 37163.85k 37530.39k 37802.93k 37861.76k
aes-128 cbc 46546.88k 46812.91k 47449.30k 47734.63k 47655.72k
aes-192 cbc 40101.23k 40735.85k 41216.97k 41339.27k 41372.50k
aes-256 cbc 35840.54k 36149.35k 36431.92k 36527.26k 36553.03k
sign verify sign/s verify/s
rsa 512 bits 0.0009s 0.0001s 1134.5 12481.6
rsa 1024 bits 0.0048s 0.0003s 207.3 3978.1
rsa 2048 bits 0.0304s 0.0008s 32.9 1180.0
rsa 4096 bits 0.2016s 0.0029s 5.0 342.4
sign verify sign/s verify/s
dsa 512 bits 0.0008s 0.0009s 1277.2 1099.3
dsa 1024 bits 0.0025s 0.0030s 395.7 330.4
dsa 2048 bits 0.0083s 0.0102s 120.8 98.5

Athlon64/AMD64

OpenSSL 0.9.7c 30 Sep 2003
built on: Sun Mar 7 16:43:49 EST 2004
options:bn(64,64) md2(int) rc4(ptr,int) des(ptr,risc2,4,int) aes(partial) blowfish(idx)
compiler: cc
available timing options: USE_TOD HZ=128 [sysconf value]
timing function used: getrusage
The 'numbers' are in 1000s of bytes per second processed.

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md2 1204.23k 2571.24k 3592.71k 3988.41k 4121.45k
mdc2 5161.87k 6297.17k 6657.93k 6762.99k 6798.87k
md4 11923.81k 39554.35k 111188.27k 202296.83k 266392.58k
md5 9601.32k 29942.05k 80814.26k 139825.29k 178806.72k
hmac(md5) 5180.13k 18441.17k 56860.14k 118549.58k 173822.73k
sha1 10504.36k 32097.17k 82315.95k 134920.10k 166304.83k
rmd160 8125.53k 23455.99k 51355.48k 73436.98k 83672.80k
rc4 133123.48k 137170.19k 140310.94k 141107.26k 141285.73k
des cbc 38653.14k 39868.02k 40403.14k 40538.90k 40579.26k
des ede3 15100.59k 15479.58k 15569.25k 15591.39k 15598.30k
idea cbc 0.00 0.00 0.00 0.00 0.00
rc2 cbc 18303.26k 18731.89k 18895.34k 18865.81k 18888.15k
rc5-32/12 cbc 95149.75k 104157.93k 106357.28k 107218.10k 107529.66k
blowfish cbc 67697.27k 72494.70k 73492.28k 73912.48k 74057.74k
cast cbc 49498.75k 52099.27k 52691.71k 52933.53k 53014.57k
aes-128 cbc 83711.88k 86110.78k 87626.25k 88013.83k 88133.09k
aes-192 cbc 75244.86k 76571.56k 77698.76k 77985.74k 78074.84k
aes-256 cbc 67943.89k 68785.95k 69749.08k 69994.72k 70070.49k
sign verify sign/s verify/s
rsa 512 bits 0.0003s 0.0000s 2955.1 31135.6
rsa 1024 bits 0.0011s 0.0001s 913.8 13670.3
rsa 2048 bits 0.0061s 0.0002s 162.6 4928.5
rsa 4096 bits 0.0393s 0.0006s 25.5 1572.3
sign verify sign/s verify/s
dsa 512 bits 0.0002s 0.0003s 4664.3 3882.3
dsa 1024 bits 0.0005s 0.0007s 1825.3 1500.3
dsa 2048 bits 0.0017s 0.0020s 602.5 500.5

You might notice that the Pentium4 and the Athlon64/i386 tests are using the same build of OpenSSL. That's because I didn't reinstall or recompile the operating system when switching between the two systems, and they used the same /etc/make.conf file. Since both had all of the necessary functions to qualify them as an x86 P4 design and all of the proper driver modules would be detected just the same, I saw no reason to reinstall or rebuild the OS. This isn't Windows XP -- you don't have to reinstall every time you change arch-compatible hardware, and the kernel isn't solidified after the initial install.

This also brings up an important point: the make command does not yet have any special accommodation for SSE3 (Prescott New Instructions) or for the AMD64 ISA, so both the Pentium4 and the Athlon64/AMD64 numbers are lower than they could theoretically be with future compiler optimizations. The /etc/make.conf for the Athlon64/AMD64 used no CPU flags (that means none were put into make.conf, it does not mean that I used the NOCPUFLAGS options) because it's technically a different ISA. This was on the advice of David O'Brien of the FreeBSD team via the AMD64 mailing list.

OpenSSL in FreeBSD has hand-optimized assembler code for i386 and will therefore be more favorable to the Pentium4 in this test scenario when comparing certain ciphers (see the update heading below for details); the AMD64 code is all in C, so the code will perform differently. According to FreeBSD developers it is possible to optimize the code for AMD64 in the same way, but it would increase the clutter of the base system and it's a matter of debate whether or not it should be done.

I couldn't think of a good way to use a table to compare all of the data from the OpenSSL tests, so you'll just have to compare the numbers by looking back and forth. Or if you want a quicker answer, the Pentium4 is significantly faster than the Athlon64/AMD64 and slightly faster than the Athlon64/i386 in the smaller algorithms, but the Athlon64/AMD64 blows the other two out of the water in the larger RSA and DSA data management algorithms, with the Athlon64/i386 in second and the Pentium4 in a distant third.

Update 04/29/04

According to one OpenSSL hacker, some ciphers are better than others for comparing hardware performance. The best cipher to compare the three runs is AES because it has no hand-optimized assembler code for any architecture and it benefits from the additional registers in architectures like AMD64, a major difference from i386 which is often overlooked. In the i386 code, the Athlon64 falls 40% behind the P4 purely due to clock speed, but when AMD64 is used it comes out about 5% ahead. This is a clear example of why 64-bit CPUs may provide performance improvements even for code which doesn't need 64-bit integers or pointers: AES, for example, consists only of 8-bit operations.

Conclusions

This was a much larger project than it may seem by just reading this article, and if I have any regret it is that I didn't do more tests. If I had it to do over I'd try to include more real-world programs that I could measure performance with, perhaps md5 or some video encoders. On the other hand this review has achieved nearly 1200 lines of code and has taken several hours just in entering the data from the results files. In writing a review like this you have to take into account the attention span of the readership, and a longer review would not be beneficial to anyone. Still, I have plans for a second part with testing in X11 involving 3D rendering (hopefully) and other interesting real-world and stopwatch tests.

In the end I think the initial point is made with this review though, and that is that 64-bit does make a difference to the "average user" as well as the power user or administrator, but that performance advantage may not be evident in all situations. When under heavy load or dealing with large blocks of data, the Athlon64 (and we can assume that the Opteron and Athlon64-FX also apply) in 64-bit mode achieves superior performance to the same machine in IA32 (x86) mode. This is not so much because of the 64-bit addressing as it is the fact that there are twice as many general-purpose registers available (see my AMD64 ISA article for details).

Where you won't notice any difference is in compiling with GCC, which doesn't seem to give any preference to which processor or architecture it is compiling on in with these systems. I have no experience with other compilers, so I can't guess as to how they would perform under these conditions (or if they would even work).

Another interesting conclusion that can be drawn is that 64-bit makes a bigger difference to performance than Hyper-Threading Technology does. While HT seems to offer better performance with multiple concurrent processes under light CPU loads, the AMD64 is built to achieve better performance under heavy loads.

I didn't write this to start a fight between AMD and Intel people; I myself am neither, although I've been accused of being an Intel fanboy once or twice (and despite that, I'm presently using the Athlon64 machine for my primary workstation). Rather the point of this article is to help people understand how these architectures are different in terms of performance so that they can decide for themselves how they might benefit from these technologies. The Pentium4 system, while it is not the "winner" in the real-world tests, is not a slow machine and it shouldn't be disregarded because of my findings. I don't use it because the temps are so high that it makes the fan too loud for me to tolerate, and I can't put the side panel on my machine without the risk of overheating. There are also problems with HT and the ATAng code -- it appears to be a locking problem, according to the mailing lists -- and it's not worth all of the system crashes. I'm not running my AMD machine in 64-bit mode though, because there are no acceptable word processors available for it yet, and Linux binary compatibility has not yet been implemented. This isn't to imply that it is unusable, but it certainly does limit desktop use for me.

This review should stand as both a basis for more testing (for myself and others) and as a measuring stick by which all future comparisons of this nature are judged. I've read so many bad reviews on the "big" sites that something like this needs to be done in as scientific a way as possible to discourage the further use of synthetic benchmarks and bareboard testing (which even I was at one time guilty of, with my 32-bit motherboard reviews; I humbly repent) to draw inaccurate conclusions about these technologies.

Discuss this article or get technical support on our forum.

Copyright 2004 Jem Matzan.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

Last Updated ( Jan 30, 2007 at 06:27 AM )
<Previous   Next>

The Jem Report is part of the JEM Electronic Media network of information technology Web sites.
Spammers can email us here