|
There have been many reviews and articles about the AMD64 architecture, but thus far there have been no performance reviews that show the advantage of 64-bit. Likewise there have been no comparisons between AMD's new 64-bit processors and Intel's Pentium4 Prescott core processors when taking full advantage of their technologies. Why do people benchmark the Athlon 64 and Opteron in i386 mode? Why do people benchmark the P4 processor with Hyper-Threading turned off? When people buy or build systems with these technologies, they expect to use them -- and they want to know how they compare to one another. Today all of that will change; I have the world's first 32 and 64-bit performance comparison between the Athlon64 3200+ and the Pentium4 3.2E.
Introduction/Disclaimer
To begin with, there are only a handful of solid conclusions that can be drawn from this benchmarking project, and those are listed at the end in the conclusions section. You must remember that the numbers printed in this review are only good for comparing performance in FreeBSD; other operating systems are almost guaranteed to have different ways of optimizing code and could suggest different conclusions about the raw performance of the hardware in question. So what I'm saying is, this information is true in FreeBSD 5.2.1-RELEASE with the stated hardware configurations. Any deviation from this configuration or software could reveal different results that sway superiority in a different direction, although my experience tells me that such vastly different data would be highly unusual. In general, you can probably assume that the data in this article will be indicative of performance on any operating system or with any similar hardware using identical technologies, but I can't guarantee that. Only further testing with different software will reveal the entire picture.
In this and several other benchmarking projects done with a variety of software in different operating systems I have never seen any performance advantage to turning off Hyper-Threading Technology in a Pentium4 processor. I don't know where the rumors of weaker performance with HT are coming from -- and I don't care (please don't email me with links to other people's reviews of synthetic tests in Windows) -- but you should give them little or no credence, or at very least be skeptical of them.
The Hardware
For this article I acquired the following hardware to use for two systems. They shared the same optical drive, hard drive, video card, RAM, cables, power supply, and chassis. Only the motherboard and CPU were changed to switch from the AMD machine to the Intel machine. This was done to prevent variations that could be caused by hardware manufacturing flaws or differences in output due to brand.
- Asus K8V Deluxe
- AMD Athlon64 3200+
- Thermaltake K8 Silent Boost HSF
- Intel D875PBZ (rev. 301)
- Intel Pentium4 3.2E
- Corsair PC3200 TwinX-LL 1024MB kit
- Western Digital 36GB Raptor SATA hard drive
- Sony DDU1621 DVD-ROM
- ATI Radeon 9800Pro All-In-Wonder 128MB
- Antec TrueBlue 480w power supply
- Skyhawk Galaxy case with front, rear, side, and cowl fans
The CPUs were sent to me by their respective manufacturers. The AMD processor is a standard OEM edition available to computer manufacturers and through some retail channels. The Intel processor is a "Confidential" series pre-release sample. I know I said I would never write a review of pre-release hardware, but that was in reference to motherboards, which can often perform better than their future retail box incarnation (for example the Asus P4S8X and all of the Intel E7205-based motherboard pre-releases were significantly better performers than the final release edition). My experience with pre-release CPUs (of which I have seen and tested four) is that they are always of the same technology as the final retail or OEM editions. It's one thing to make a super-powered motherboard for the press; it's entirely another thing to modify a CPU for the same nefarious purposes. To be certain, I asked my press contact at Intel if there have been any changes since the Confidential pre-release samples, and he said that they were technologically identical to the consumer release, with the exception of errata corrections, stepping numbers and the multiplier lock (meaning I can change the clock multiplier in the BIOS, a feature which is disabled in the consumer release to discourage overclocking). I trust Intel's word on the matter because my experience with them has indicated that they're not liars when it comes to issues like this. Just the same, the possibility exists that my results can be different than those obtained with a retail box P4 3.2E, but if that is the case the numbers would likely be in favor of the retail processor, not the pre-release sample. Again, we're talking about a possibility (in a random Universe, anything is possible) attached to a small likelihood of a significant performance difference.
The heatsink/fan (HSF) unit that I used for the Intel processor came from Intel. It's a modified version of the traditional socket478 fan, except it has a larger copper core than the previous edition and the fins on the heatsink are in a sort of star pattern. The locking mechanism is the same. The heatsink compound I used was already on the bottom of the heatsink in a small gray pad. Intel provided a syringe of extra compound, but I didn't have cause to use it.
The Thermaltake K8 Silentboost is an excellent solid-copper HSF. It's a good thing, too -- it was my only choice. It seems there aren't (or weren't when I bought this unit two months ago) many manufacturers that make HSFs for AMD64 processors. For the Athlon64 processor I used the standard-issue white heatsink compound, which is verified and certified by AMD.
The RAM was sent to me by Corsair for this and other benchmarking projects. It is the same retail box kit that you can buy through any authorized reseller. I could have requested RAM from a number of other manufacturers, but I chose Corsair for its high level of compatibility with motherboards, its low latency and reliable performance.
The WD Raptor is the fastest SATA drive on the market, and I acquired it for this and other benchmarking projects.
The Radeon 9800Pro AIW was sent to me by ATI for a previous review of SciTech's SNAP Graphics drivers. I chose it for this review because it is a reasonable choice for a high-end single CPU workstation, and because although I didn't do any graphics testing in this review, I plan on doing several graphic-intensive benchmarking projects in the future and I will be using this card for those tests. It helps to keep things as standard as possible to maintain cross-compatibility with my reviews.
The Antec TrueBlue 480 is both quiet and powerful. I had a long internal debate over whether I should get this power supply or the Vantec Stealth 420. Both are excellent supplies, but in the end I decided that the slightly cheaper Antec would be a better choice for this project because of its automatic fan control (the Vantec has a manual switch on the back) and its higher voltage and amperage ratings. The blue LED is superfluous -- you can hardly see it when the case covers are on.
The Skyhawk case was a poor choice, but it's all I had available to me. It is not FCC approved because of the acrylic window in the side, although I eliminated that variable by leaving the side cover off for all of my testing. This was also to improve ventilation and maintain a more consistent temperature inside the system. This case is totally unsuitable for a system based on the Prescott core because of its high operating temperature; despite all of the fans it has in it, ten seconds of idle operation with the side cover on forced the CPU fan to speeds of nearly 5000RPM.
Each system was assembled with care and all wires and connectors were correctly connected according to the manual. The BIOS was adjusted as necessary and the RAM sticks were in the proper slots for best performance. I decided to conduct my tests in a real computer chassis because, oddly, no other reviewers seem to do that. They bare-board everything, which means that they will never discover problems intrinsic to chassis assembly, such as the trouble I had with the Prescott system's fan noise. This benchmarking project was designed to closely mimic a real workstation system, not a fictional lab testing environment.
The Software
The operating system I used was FreeBSD 5.2.1-RELEASE. If you'd like to learn more about how I configured the operating system and how I devised my benchmarking methods, or if you'd like to learn how to benchmark hardware using FreeBSD, I've written a separate article about it here.
I used the standard Unix time command to conduct stopwatch tests, stream and ubench for synthetic tests, and OpenSSL, oggenc, and cdparanoia for my real-world tests. I did not conduct any testing in X -- that would be a totally separate review, and the research and testing for it have already begun.
I did not generate statistics because it's not easy to do with three test cases and I felt that they were unnecessary anyway. I also didn't make any graphics to show differences in performance. If you want to see pretty graphs that mislead readers and suggest flawed conclusions, you'll be disappointed with this review. You shouldn't need a graph or chart to put this data in perspective anyway -- it's pretty straightforward.
I tested everything with both the ULE and the 4BSD scheduler. Overall I found the 4BSD scheduler to be measurably faster in most cases. This is in contrast to the general belief that ULE is "faster" than 4BSD. To include a comparison of the schedulers would make this article a bit too long for most people's tastes, so I wrote a separate review for the scheduler comparison, and it can be found here. For the purpose of this article I chose to use the numbers from the 4BSD scheduler because they were generally faster for all three test cases and because the 4BSD scheduler is more mature and more likely to be used in a production environment (or even for regular desktop use).
Stopwatch Tests
All time is listed in seconds and each number represents the mean average of the real time (the total elapsed time), user time (the time it takes to execute the utility), and system overhead time of three distinct test iterations.
It's simple: I timed how long it took to compile the base system with varying numbers of concurrent processes. I also compiled Apache version 2.0.48_3 using no concurrent processes. The latter is a much shorter test and the code is the same for both architectures, whereas there is some debate over whether it is fair to compare an AMD64 buildworld time and an i386 buildworld time because of the variation in the code. According to David O'Brien -- the FreeBSD developer in charge of the AMD64 architecture -- about 80% of the AMD64 instruction set is the i386 instruction set, and most of the code generation code is shared between the two, so if there's a difference it isn't a big one. But the code is different and thus it isn't fair to directly compare the times.
By themselves these compile times are not a reliable indicator of performance because of the dependence on the compiler (GCC, in this case), but when viewed as part of a benchmarking project these results have their place. After all, compile times are very important to a lot of people regardless of the reasons why the numbers are different. So keep in mind that we're benchmarking GCC (version 3.2.1) performance as well as system performance with these tests. The AMD64 code in GCC of course is not nearly as mature as the i386 code, so right off the bat the Athlon64/AMD64 times are going to be slower than they could be.
For the Apache2 build test I built the port and let it download and install all of the necessary dependencies. I then uninstalled Apache2 only -- leaving the dependencies in place and the downloaded source code in the distfiles directory -- and restarted in single-user mode, where Apache2 was rebuilt and timed. The time includes clean time; the exact command was time make install clean
Buildworld Real Time |
| Concurrent Processes |
Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| -j2 |
2346.29 |
2290.25 |
2132.81 |
Athlon64/AMD64 |
| -j3 |
2096.52 |
2096.63 |
2019.88 |
Athlon64/AMD64 |
| -j4 |
1999.00 |
2016.37 |
1965.46 |
Athlon64/AMD64 |
|
The User and System times are not as important as the Real time; the Real time is the total time elapsed for the test, and it's really the only time that matters. But the other scores show some interesting results:
Buildworld User Time |
| Concurrent Processes |
Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| -j2 |
2221.36 |
1420.85 |
1435.54 |
Athlon64/i386 |
| -j3 |
2365.02 |
1427.53 |
1438.23 |
Athlon64/i386 |
| -j4 |
2416.14 |
1427.68 |
1439.45 |
Athlon64/i386 |
|
Buildworld System Time |
| Concurrent Processes |
Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| -j2 |
408.69 |
246.61 |
372.80 |
Athlon64/i386 |
| -j3 |
445.87 |
263.92 |
389.79 |
Athlon64/i386 |
| -j4 |
465.16 |
272.46 |
396.17 |
Athlon64/i386 |
|
The Pentium4 times appear to be impossible; according to the numbers for -j3 and -j4, it takes longer to execute the utility than it does to complete the entire process. This is due to Hyper-Threading -- the user and system times are exactly twice what they would otherwise be because there are two virtual CPUs. Notice also that the Pentium4 isn't significantly faster than the Athlon64/i386 despite the advantage of Hyper-Threading. This could be due to the vastly different design (longer pipeline, more latency, smaller pathways, etc. -- see my article on Prescott technology for details) in the Prescott core.
Apache 2 Real Time |
| Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| 153.19 |
121.09 |
137.91 |
Athlon64/i386 |
|
Apache 2 User Time |
| Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| 82.38 |
73.04 |
78.93 |
Athlon64/i386 |
|
Apache 2 System Time |
| Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| 55.84 |
33.10 |
50.99 |
Athlon64/i386 |
|
It's worth noting that there is a reproducible anomaly in the data for the Athlon64/i386 in which the ULE scheduler provides faster scores than the 4BSD scheduler in all cases. If I were to take the best scores between the two schedulers for each test, the Athlon64/i386 would be about two seconds faster than the Pentium4 in the buildworld -j4 test. Making that comparison between schedulers is not a fair test, however, because the test conditions for all machines would not be equal.
What you can't see with the averages I've put in above is that the Intel system has a larger degree of fluctuation in the times, whereas the AMD system in either mode produces far more consistent results. The times were particularly wacky (as much of a deviation as 8 seconds) as more concurrent processes were added to the buildworld test. It's my belief that the abnormally high operating temperature of the Pentium4 Prescott core (60 degrees C at idle) is to blame for its inconsistency, although there could also be some mistakes in the still-not-ready-for-primetime SMPng code that are providing suboptimal performance in a Hyper-Threaded system. The CPU temperature noticeably fluctuates under load when in a room temperature environment. Although I can't measure it while running tests, I can hear the CPU fan increase in RPMs over the length of the test. I would guess that the RPMs increase by roughly 1000RPMs or more, judging by the sound.
Synthetic Benchmarks
Synthetic tests can reveal information that we might not otherwise be able to obtain, but in general you should not put a lot of stock in them. If they were accurate I would think them more useful as a method of determining performance, but their numbers are not consistent with the other results from stopwatch and real-world testing. Part of the reason why I am printing these results is to show just how misleading a synthetic test can be -- take heed all ye synthetic benchmark monkeys.
Comparing synthetic benchmark numbers is only meaningful when comparing them with the same synthetic benchmarks with the same configuration. The memory bandwidth test is actually useful, but I'm sure it wasn't configured properly when I used it.
Stream measures memory bandwidth, although I don't trust it. I tried to contact the author of the benchmark to get his help in properly configuring this software, but after one email he abandoned me. The Stream website won't accept a connection from my machine (or are they blocking my ISP? Or is it IE-only? I'll never know), the Google cache didn't reveal the answers I needed and I had no access to reasonable documentation. The below numbers are the bandwidth rates in megabytes per second. I already know from testing these systems with SiSoft Sandra in 32-bit Windows that these numbers are not accurate (they seem to be exactly 1000 MB/sec too low), but they appear to be uniformly inaccurate as far as I can tell, so they are still useful as a point of comparison. Please note that I modified the Makefile according to the only directions on Stream that I could find. Please see the aforementioned FreeBSD benchmarking article for details.
Stream |
|
Pentium4 |
Athlon64/i386 |
Athlon64/AMD64 |
Fastest? |
| Copy |
3413.33 |
2048 |
2048 |
Pentium4 |
| Scale |
3413.33 |
2048 |
2048 |
Pentium4 |
| Add |
3072 |
2194.29 |
1920 |
Pentium4 |
| Triad |
3072 |
1920 |
1920 |
Pentium4 |
|
As far as I can tell from other testing, the Pentium4 should indeed have higher bandwidth because of the higher core frequency speed and the dual-channel memory controller, although the Athlon64 should have lower latency because of the on-die memory controller. There was no need to average the results of all three test runs because the results were identical each time for all three systems.
The ubench test is an old Unix benchmark test, kind of the "old standby," like 3DMark in the Windows world. It is absolutely meaningless -- it produces a number that rates the CPU and RAM performance, and that number is only useful when comparing it to other ubench numbers.
It seems to be quite buggy, as I never once got it to complete its testing procedure. It would do the CPU test and then exit on a signal 6 (in 64-bit mode) or signal 11 (in i386 mode) when doing the memory test. Although it compiles for AMD64, it doesn't produce the kind of result that one would expect (based on the other tests in this project) in 64-bit mode.
Ubench CPU |
| Pentium4 |
Athlon64/i386 |
Athlon64/AMD64 |
Fastest? |
| 137239 |
97348 |
71352 |
Pentium4 |
|
Now obviously that number for Athlon64/AMD64 is not accurate, but unfortunately you don't discover these kinds of problems until after the testing is over with. Still, maybe it'll help the ubench people fix their bugs.
Real-World Tests
This is the most useful of all of the data I collected because it shows how a system will perform in real-world scenarios. I didn't test a lot of different programs here because many of the tests that I think would be best must be performed in X11. I tried ripping a CD with cdparanoia, but the results were too close to say that there was a meaningful difference between systems. I attribute that to the CPUs being able to handle more work than the optical drive could provide for them. Just in case you're curious, the CD I tested with is LA Woman by The Doors, and it took roughly 660 seconds to rip it to the hard drive. From there I encoded the tracks with oggenc from the vorbis-tools port. The times below are, as above, listed in seconds and they represent mean averages from three separate testing runs. The exact command was time oggenc * and it was run in a directory containing only the ripped WAV files from the cdparanoia test.
Oggenc Real Time |
| Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| 249.64 |
258.54 |
170.10 |
Athlon64/AMD64 |
|
Again, the User and System times are not as important as the Real time listed above; the Real time is the total time elapsed for the test, and it's really the only time that matters.
Oggenc User Time |
| Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| 247.87 |
256.46 |
168.35 |
Athlon64/AMD64 |
|
Oggenc System Time |
| Pentium4 Time |
Athlon64/i386 Time |
Athlon64/AMD64 Time |
Fastest? |
| 1.05 |
0.94 |
0.67 |
Athlon64/AMD64 |
|
There is quite a big difference between the 64-bit and 32-bit times, as you can see. It is abundantly clear that for encoding with oggenc, 64-bit makes a very significant difference in performance. Although I didn't test other encoders, I shall put forth the hypothesis that all audio and video encoding (and probably decoding) would be fastest (among the given machines in this review) with the Athlon64 in 64-bit mode.
Lastly I used OpenSSL from the FreeBSD base system as a test (click here for the OpenSSL documentation). The output was piped to a text file for each run. The exact command used was openssl speed >run1.txt replacing the number in the text file name to correspond with the number of the test run.
I selected the first run from each 4BSD testing group for publication. There were some variances, but I didn't notice a significant difference in the results and there was no way I was going to take the time to do averages on all of these numbers.
Pentium4 3.2E
OpenSSL 0.9.7c 30 Sep 2003
built on: Mon Mar 1 21:51:43 EST 2004
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) blowfish(idx)
compiler: cc
available timing options: USE_TOD HZ=128 [sysconf value]
timing function used: getrusage
The 'numbers' are in 1000s of bytes per second processed. |
| type |
16 bytes |
64 bytes |
256 bytes |
1024 bytes |
8192 bytes |
| md2 |
2016.84k |
4322.17k |
6070.05k |
6751.71k |
6975.67k |
| mdc2 |
5187.62k |
6309.10k |
6663.01k |
6842.96k |
6876.25k |
| md4 |
14902.30k |
56885.19k |
164041.24k |
360880.61k |
557337.75k |
| md5 |
13533.88k |
46715.49k |
127714.46k |
242926.90k |
329674.50k |
| hmac(md5) |
7378.81k |
27055.24k |
85550.47k |
197721.01k |
317374.48k |
| sha1 |
11164.95k |
30949.10k |
69059.28k |
101680.00k |
117127.47k |
| rmd160 |
9379.24k |
26992.07k |
53580.32k |
78348.85k |
89972.69k |
| rc4 |
131738.10k |
147298.92k |
149236.43k |
150199.64k |
150488.26k |
| des cbc |
29414.18k |
31502.57k |
33198.69k |
33733.32k |
34911.05k |
| des ede3 |
7834.13k |
8009.96k |
8034.31k |
7928.99k |
7964.91k |
| idea cbc |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
| rc2 cbc |
23213.44k |
24324.99k |
24457.96k |
24481.34k |
24506.93k |
| rc5-32/12 cbc |
117025.06k |
136453.15k |
135906.90k |
136299.75k |
136531.75k |
| blowfish cbc |
63282.39k |
65804.80k |
66005.56k |
66382.67k |
66462.08k |
| cast cbc |
53759.70k |
58726.38k |
59652.19k |
60284.82k |
60165.25k |
| aes-128 cbc |
79154.59k |
78949.15k |
81544.36k |
81947.53k |
82485.71k |
| aes-192 cbc |
70636.01k |
70085.72k |
71894.33k |
72926.05k |
73213.11k |
| aes-256 cbc |
62936.16k |
63083.62k |
64930.03k |
65524.64k |
65603.78k |
|
|
sign |
verify |
sign/s |
verify/s |
| rsa |
512 bits |
0.0012s |
0.0001s |
819.1 |
8892.4 |
| rsa |
1024 bits |
0.0068s |
0.0003s |
147.2 |
3016.8 |
| rsa |
2048 bits |
0.0400s |
0.0011s |
25.0 |
945.4 |
| rsa |
4096 bits |
0.2531s |
0.0038s |
4.0 |
264.0 |
|
|
sign |
verify |
sign/s |
verify/s |
| dsa |
512 bits |
0.0011s |
0.0013s |
926.0 |
795.7 |
| dsa |
1024 bits |
0.0032s |
0.0038s |
311.3 |
264.5 |
| dsa |
2048 bits |
0.0103s |
0.0122s |
97.0 |
82.1 |
|
Athlon64/i386
OpenSSL 0.9.7c 30 Sep 2003
built on: Mon Mar 1 21:51:43 EST 2004
options:bn(64,32) md2(int) rc4(idx,int) des(ptr,risc1,16,long) aes(partial) blowfish(idx)
compiler: cc
available timing options: USE_TOD HZ=128 [sysconf value]
timing function used: getrusage
The 'numbers' are in 1000s of bytes per second processed. |
| type |
16 bytes |
64 bytes |
256 bytes |
1024 bytes |
8192 bytes |
| md2 |
1638.38k |
3478.62k |
4841.21k |
5365.51k |
5537.52k |
| mdc2 |
4254.16k |
4813.89k |
4961.38k |
5002.23k |
5012.39k |
| md4 |
14880.30k |
51939.53k |
148904.64k |
278367.22k |
375362.23k |
| md5 |
11737.69k |
38977.26k |
103199.60k |
172939.25k |
215400.27k |
| hmac(md5) |
6663.99k |
23743.44k |
72134.23k |
146192.86k |
209540.45k |
| sha1 |
12623.98k |
37535.03k |
85321.96k |
125001.94k |
144802.48k |
| rmd160 |
9267.16k |
26074.68k |
54288.80k |
74601.53k |
83762.21k |
| rc4 |
104211.17k |
109806.94k |
111064.92k |
112103.17k |
111971.22k |
| des cbc |
21634.75k |
22290.56k |
22437.69k |
22491.36k |
22506.44k |
| des ede3 |
7844.08k |
7898.53k |
7924.55k |
7931.19k |
7932.89k |
| idea cbc |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
| rc2 cbc |
16947.92k |
17566.40k |
17732.03k |
17761.42k |
17877.29k |
| rc5-32/12 cbc |
90145.55k |
105133.21k |
109166.02k |
110619.16kM |
111020.08k |
| blowfish cbc |
53110.73k |
57511.04k |
58590.09k |
58910.42k |
59084.57k |
| cast cbc |
35109.22k |
37163.85k |
37530.39k |
37802.93k |
37861.76k |
| aes-128 cbc |
46546.88k |
46812.91k |
47449.30k |
47734.63k |
47655.72k |
| aes-192 cbc |
40101.23k |
40735.85k |
41216.97k |
41339.27k |
41372.50k |
| aes-256 cbc |
35840.54k |
36149.35k |
36431.92k |
36527.26k |
36553.03k |
|
|
sign |
verify |
sign/s |
verify/s |
| rsa |
512 bits |
0.0009s |
0.0001s |
1134.5 |
12481.6 |
| rsa |
1024 bits |
0.0048s |
0.0003s |
207.3 |
3978.1 |
| rsa |
2048 bits |
0.0304s |
0.0008s |
32.9 |
1180.0 |
| rsa |
4096 bits |
0.2016s |
0.0029s |
5.0 |
342.4 |
|
|
sign |
verify |
sign/s |
verify/s |
| dsa |
512 bits |
0.0008s |
0.0009s |
1277.2 |
1099.3 |
| dsa |
1024 bits |
0.0025s |
0.0030s |
395.7 |
330.4 |
| dsa |
2048 bits |
0.0083s |
0.0102s |
120.8 |
98.5 |
|
Athlon64/AMD64
OpenSSL 0.9.7c 30 Sep 2003
built on: Sun Mar 7 16:43:49 EST 2004
options:bn(64,64) md2(int) rc4(ptr,int) des(ptr,risc2,4,int) aes(partial) blowfish(idx)
compiler: cc
available timing options: USE_TOD HZ=128 [sysconf value]
timing function used: getrusage
The 'numbers' are in 1000s of bytes per second processed. |
| type |
16 bytes |
64 bytes |
256 bytes |
1024 bytes |
8192 bytes |
| md2 |
1204.23k |
2571.24k |
3592.71k |
3988.41k |
4121.45k |
| mdc2 |
5161.87k |
6297.17k |
6657.93k |
6762.99k |
6798.87k |
| md4 |
11923.81k |
39554.35k |
111188.27k |
202296.83k |
266392.58k |
| md5 |
9601.32k |
29942.05k |
80814.26k |
139825.29k |
178806.72k |
| hmac(md5) |
5180.13k |
18441.17k |
56860.14k |
118549.58k |
173822.73k |
| sha1 |
10504.36k |
32097.17k |
82315.95k |
134920.10k |
166304.83k |
| rmd160 |
8125.53k |
23455.99k |
51355.48k |
73436.98k |
83672.80k |
| rc4 |
133123.48k |
137170.19k |
140310.94k |
141107.26k |
141285.73k |
| des cbc |
38653.14k |
39868.02k |
40403.14k |
40538.90k |
40579.26k |
| des ede3 |
15100.59k |
15479.58k |
15569.25k |
15591.39k |
15598.30k |
| idea cbc |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
| rc2 cbc |
18303.26k |
18731.89k |
18895.34k |
18865.81k |
18888.15k |
| rc5-32/12 cbc |
95149.75k |
104157.93k |
106357.28k |
107218.10k |
107529.66k |
| blowfish cbc |
67697.27k |
72494.70k |
73492.28k |
73912.48k |
74057.74k |
| cast cbc |
49498.75k |
52099.27k |
52691.71k |
52933.53k |
53014.57k |
| aes-128 cbc |
83711.88k |
86110.78k |
87626.25k |
88013.83k |
88133.09k |
| aes-192 cbc |
75244.86k |
76571.56k |
77698.76k |
77985.74k |
78074.84k |
| aes-256 cbc |
67943.89k |
68785.95k |
69749.08k |
69994.72k |
70070.49k |
|
|
sign |
verify |
sign/s |
verify/s |
| rsa |
512 bits |
0.0003s |
0.0000s |
2955.1 |
31135.6 |
| rsa |
1024 bits |
0.0011s |
0.0001s |
913.8 |
13670.3 |
| rsa |
2048 bits |
0.0061s |
0.0002s |
162.6 |
4928.5 |
| rsa |
4096 bits |
0.0393s |
0.0006s |
25.5 |
1572.3 |
|
|
sign |
verify |
sign/s |
verify/s |
| dsa |
512 bits |
0.0002s |
0.0003s |
4664.3 |
3882.3 |
| dsa |
1024 bits |
0.0005s |
0.0007s |
1825.3 |
1500.3 |
| dsa |
2048 bits |
0.0017s |
0.0020s |
602.5 |
500.5 |
|
You might notice that the Pentium4 and the Athlon64/i386 tests are using the same build of OpenSSL. That's because I didn't reinstall or recompile the operating system when switching between the two systems, and they used the same /etc/make.conf file. Since both had all of the necessary functions to qualify them as an x86 P4 design and all of the proper driver modules would be detected just the same, I saw no reason to reinstall or rebuild the OS. This isn't Windows XP -- you don't have to reinstall every time you change arch-compatible hardware, and the kernel isn't solidified after the initial install.
This also brings up an important point: the make command does not yet have any special accommodation for SSE3 (Prescott New Instructions) or for the AMD64 ISA, so both the Pentium4 and the Athlon64/AMD64 numbers are lower than they could theoretically be with future compiler optimizations. The /etc/make.conf for the Athlon64/AMD64 used no CPU flags (that means none were put into make.conf, it does not mean that I used the NOCPUFLAGS options) because it's technically a different ISA. This was on the advice of David O'Brien of the FreeBSD team via the AMD64 mailing list.
OpenSSL in FreeBSD has hand-optimized assembler code for i386 and will therefore be more favorable to the Pentium4 in this test scenario when comparing certain ciphers (see the update heading below for details); the AMD64 code is all in C, so the code will perform differently. According to FreeBSD developers it is possible to optimize the code for AMD64 in the same way, but it would increase the clutter of the base system and it's a matter of debate whether or not it should be done.
I couldn't think of a good way to use a table to compare all of the data from the OpenSSL tests, so you'll just have to compare the numbers by looking back and forth. Or if you want a quicker answer, the Pentium4 is significantly faster than the Athlon64/AMD64 and slightly faster than the Athlon64/i386 in the smaller algorithms, but the Athlon64/AMD64 blows the other two out of the water in the larger RSA and DSA data management algorithms, with the Athlon64/i386 in second and the Pentium4 in a distant third.
Update 04/29/04
According to one OpenSSL hacker, some ciphers are better than others for comparing hardware performance. The best cipher to compare the three runs is AES because it has no hand-optimized assembler code for any architecture and it benefits from the additional registers in architectures like AMD64, a major difference from i386 which is often overlooked. In the i386 code, the Athlon64 falls 40% behind the P4 purely due to clock speed, but when AMD64 is used it comes out about 5% ahead. This is a clear example of why 64-bit CPUs may provide performance improvements even for code which doesn't need 64-bit integers or pointers: AES, for example, consists only of 8-bit operations.
Conclusions
This was a much larger project than it may seem by just reading this article, and if I have any regret it is that I didn't do more tests. If I had it to do over I'd try to include more real-world programs that I could measure performance with, perhaps md5 or some video encoders. On the other hand this review has achieved nearly 1200 lines of code and has taken several hours just in entering the data from the results files. In writing a review like this you have to take into account the attention span of the readership, and a longer review would not be beneficial to anyone. Still, I have plans for a second part with testing in X11 involving 3D rendering (hopefully) and other interesting real-world and stopwatch tests.
In the end I think the initial point is made with this review though, and that is that 64-bit does make a difference to the "average user" as well as the power user or administrator, but that performance advantage may not be evident in all situations. When under heavy load or dealing with large blocks of data, the Athlon64 (and we can assume that the Opteron and Athlon64-FX also apply) in 64-bit mode achieves superior performance to the same machine in IA32 (x86) mode. This is not so much because of the 64-bit addressing as it is the fact that there are twice as many general-purpose registers available (see my AMD64 ISA article for details).
Where you won't notice any difference is in compiling with GCC, which doesn't seem to give any preference to which processor or architecture it is compiling on in with these systems. I have no experience with other compilers, so I can't guess as to how they would perform under these conditions (or if they would even work).
Another interesting conclusion that can be drawn is that 64-bit makes a bigger difference to performance than Hyper-Threading Technology does. While HT seems to offer better performance with multiple concurrent processes under light CPU loads, the AMD64 is built to achieve better performance under heavy loads.
I didn't write this to start a fight between AMD and Intel people; I myself am neither, although I've been accused of being an Intel fanboy once or twice (and despite that, I'm presently using the Athlon64 machine for my primary workstation). Rather the point of this article is to help people understand how these architectures are different in terms of performance so that they can decide for themselves how they might benefit from these technologies. The Pentium4 system, while it is not the "winner" in the real-world tests, is not a slow machine and it shouldn't be disregarded because of my findings. I don't use it because the temps are so high that it makes the fan too loud for me to tolerate, and I can't put the side panel on my machine without the risk of overheating. There are also problems with HT and the ATAng code -- it appears to be a locking problem, according to the mailing lists -- and it's not worth all of the system crashes. I'm not running my AMD machine in 64-bit mode though, because there are no acceptable word processors available for it yet, and Linux binary compatibility has not yet been implemented. This isn't to imply that it is unusable, but it certainly does limit desktop use for me.
This review should stand as both a basis for more testing (for myself and others) and as a measuring stick by which all future comparisons of this nature are judged. I've read so many bad reviews on the "big" sites that something like this needs to be done in as scientific a way as possible to discourage the further use of synthetic benchmarks and bareboard testing (which even I was at one time guilty of, with my 32-bit motherboard reviews; I humbly repent) to draw inaccurate conclusions about these technologies.
Discuss this article or get technical support on our forum.
Copyright 2004 Jem Matzan.
|