Logo by AGUS - Contribute your own Logo!
News:
<- Like it? why not donate for continuity of this forum!
info about usage
 
*
Welcome, Guest. Please login or register. September 02, 2014, 01:39:12 PM


Login with username, password and session length



Pages: 1 2 [3] 4 5   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: Most powerful computer possible for a reasonable price  (Read 6055 times)
0 Members and 1 Guest are viewing this topic.
Dinkydau
Fractal Senior
******
Posts: 1285


1337


WWW
« Reply #30 on: March 24, 2012, 02:02:12 PM »

Yes, it can use quad-channel memory. I can still decide to make use of it.
Logged

brucedawson
Forums Newbie
*
Posts: 6


« Reply #31 on: March 25, 2012, 05:33:25 AM »

Some encryption algorithms might indeed be a good indicator for overall multi precision integer arithmetic performance.

I suspect that encryption algorithms are a fairly good indicator for deep-zoom fractal performance, but I don't know that for a fact. I do suspect that normal benchmarks are a poor indicator. The main requirement for fast deep-zoom fractal calculations is a fast 64x64 multiply, and fast adc instructions. I don't know how well Intel and AMD compare.

I agree that having other programs installed and running shouldn't matter, as long as they are well behaved. Task Manager can help you find those that are not.

Anyway, I wrote a benchmark program to help answer this question. It uses the same mov/mul/add/adc/adc block that FX uses so it should accurately predict deep-zoom performance in FX, and probably in other similarly optimized fractal programs.

http://ftp://ftp.cygnus-software.com/pub/InfprecPerf.exe

Aside: apparently the throughput of FX is not helped much by hyperthreads. I didn't know that.

Here are the results from my four-core eight-thread Sandybridge laptop. It nominally runs at 2.2 GHz but it can Turboboost up to between 2.8 and 3.2 GHz, according to CPU-Z. I think that means that performance is ultimately limited by the four-cycles-per-mul throughput, and that the hyperthreads merely help hit this maximum.

CPU information:
VendorID = 'GenuineIntel'
Stepping = 0x7, model = 0xA, family = 0x6, type = 0
Signature is 0x6A7 (06_0AH)
CPU's rdtsc speed is 2.195 GHz (peak speed may be higher).

Running tests on 8 thread system.
Performance with 1 threads is 0.981 GBlocks/sec
Performance with 2 threads is 1.988 GBlocks/sec
Performance with 3 threads is 2.355 GBlocks/sec
Performance with 4 threads is 2.994 GBlocks/sec
Performance with 5 threads is 3.547 GBlocks/sec
Performance with 6 threads is 3.681 GBlocks/sec
Performance with 7 threads is 3.672 GBlocks/sec
Performance with 8 threads is 3.816 GBlocks/sec

CPU information:
VendorID = 'GenuineIntel'
Stepping = 0x6, model = 0xD, family = 0x6, type = 0
Signature is 0x6D6 (06_0DH)
CPU's rdtsc speed is 3.202 GHz (peak speed may be higher).

On my six-core twelve-thread Sandybridge work machine the results are:

Running tests on 12 thread system.
Performance with 1 threads is 1.149 GBlocks/sec
Performance with 2 threads is 2.268 GBlocks/sec
Performance with 3 threads is 3.263 GBlocks/sec
Performance with 4 threads is 3.961 GBlocks/sec
Performance with 5 threads is 4.803 GBlocks/sec
Performance with 6 threads is 5.447 GBlocks/sec
Performance with 7 threads is 6.123 GBlocks/sec
Performance with 8 threads is 6.365 GBlocks/sec
Performance with 9 threads is 6.452 GBlocks/sec
Performance with 10 threads is 6.506 GBlocks/sec
Performance with 11 threads is 6.591 GBlocks/sec
Performance with 12 threads is 6.667 GBlocks/sec

The system process was consuming about 6% of CPU time, otherwise I would have hit about 7.0 GBlocks/sec.

If you have other results from different CPUs please send them to me. Note that this program is 64-bit only. As somebody observed, 32-bit performance is not really comparable. It will typically be 4-5x worse, so not very interesting.
Logged
stardust4ever
Conqueror
*******
Posts: 147



« Reply #32 on: March 25, 2012, 10:46:48 AM »

Hi Bruce! Very glad to see you have joined our community. Welcome to the forum! grin

Anyway, I wrote a benchmark program to help answer this question. It uses the same mov/mul/add/adc/adc block that FX uses so it should accurately predict deep-zoom performance in FX, and probably in other similarly optimized fractal programs.

http://ftp://ftp.cygnus-software.com/pub/InfprecPerf.exe
FYI, the link is not formatted properly. Use the ftp tag. I have fixed this for you.
ftp://ftp.cygnus-software.com/pub/InfprecPerf.exe

I ran the test 5 times. My peak score out of 5 trials was 6.009 Gblocks/sec. I am running an  AMD FX-8150 8-core Bulldozer processor @4.2Ghz. Operating system is Windows 7 Pro 64-bit. The bus speed is ~200Mhz, with the multiplier set in BIOS to 21x, with TurboCore and Advanced Power Mandagement disabled, so the processor speed is pegged at 4.2Ghz (maximum rated turbo) whenever one or more cores are fully loaded. My RAM is 4 sticks of matching 2Gb DDR3 1333Mhz (8Gb total), dual channel, unganged.

Quote
CPU information:
VendorID = 'AuthenticAMD'
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU's rdtsc speed is 4.228 GHz (peak speed may be higher).

Running tests on 8 thread system.
Performance with 1 threads is 0.835 GBlocks/sec
Performance with 2 threads is 1.694 GBlocks/sec
Performance with 3 threads is 2.556 GBlocks/sec
Performance with 4 threads is 3.267 GBlocks/sec
Performance with 5 threads is 3.918 GBlocks/sec
Performance with 6 threads is 4.517 GBlocks/sec
Performance with 7 threads is 5.422 GBlocks/sec
Performance with 8 threads is 5.414 GBlocks/sec

Running tests on 8 thread system.
Performance with 1 threads is 0.853 GBlocks/sec
Performance with 2 threads is 1.669 GBlocks/sec
Performance with 3 threads is 2.518 GBlocks/sec
Performance with 4 threads is 3.343 GBlocks/sec
Performance with 5 threads is 3.968 GBlocks/sec
Performance with 6 threads is 4.721 GBlocks/sec
Performance with 7 threads is 4.956 GBlocks/sec
Performance with 8 threads is 5.952 GBlocks/sec

Running tests on 8 thread system.
Performance with 1 threads is 0.857 GBlocks/sec
Performance with 2 threads is 1.681 GBlocks/sec
Performance with 3 threads is 2.525 GBlocks/sec
Performance with 4 threads is 3.311 GBlocks/sec
Performance with 5 threads is 3.965 GBlocks/sec
Performance with 6 threads is 4.720 GBlocks/sec
Performance with 7 threads is 5.193 GBlocks/sec
Performance with 8 threads is 5.427 GBlocks/sec

Running tests on 8 thread system.
Performance with 1 threads is 0.845 GBlocks/sec
Performance with 2 threads is 1.691 GBlocks/sec
Performance with 3 threads is 2.522 GBlocks/sec
Performance with 4 threads is 3.328 GBlocks/sec
Performance with 5 threads is 3.947 GBlocks/sec
Performance with 6 threads is 4.498 GBlocks/sec
Performance with 7 threads is 5.335 GBlocks/sec
Performance with 8 threads is 5.936 GBlocks/sec

Running tests on 8 thread system.
Performance with 1 threads is 0.852 GBlocks/sec
Performance with 2 threads is 1.673 GBlocks/sec
Performance with 3 threads is 2.523 GBlocks/sec
Performance with 4 threads is 3.318 GBlocks/sec
Performance with 5 threads is 3.954 GBlocks/sec
Performance with 6 threads is 4.504 GBlocks/sec
Performance with 7 threads is 4.959 GBlocks/sec
Performance with 8 threads is 6.009 GBlocks/sec
It's possible that benchmark speeds may see less fluctuation if they ran for longer periods than a blink of an eye.

Bruce, one thing I am curious to know, is if thread stuffing, or running more process threads than available CPU threads, would have an impact on performance. I remember playing around with various ray-tracing software back in the day when I used to be into 3D-rendering (long before I got addicted to fractals), and on my old single-thread Athlon XP processor (this was a really, really long time ago), running two instances of the same program rendering the same image simultaneously, both renders would finish at about 20% less than twice the time it took one render to finish by itself, meaning the CPU was doing slightly more work when it was overloaded. Suppose instead of 8 threads, you run 12, 16, or even 24 threads on an 8-threaded CPU. Would performance increase, decrease, or stay the same? I ask this because occasionally, I will often have multiple instances of Fractal Extreme running simultaneously on different projects, and I'm not sure what effect it has on overall throughput.
« Last Edit: March 25, 2012, 11:26:52 AM by stardust4ever » Logged
Dinkydau
Fractal Senior
******
Posts: 1285


1337


WWW
« Reply #33 on: March 25, 2012, 01:56:19 PM »

What is it that makes the performance of multiple threads worse in proportion to one thread? Is it the turbo mode, other processes not affecting the render, or does multiple threads just inherently mean less efficiency?

Apparently rendering multiple things at once increases performance, I didn't know that.

Here's some benchmark results on my intel q6700, 4×2,66 GHz, 4 threads:
Code:
Running tests on 4 thread system.
Performance with 1 threads is 0.661 GBlocks/sec
Performance with 2 threads is 1.317 GBlocks/sec
Performance with 3 threads is 1.969 GBlocks/sec
Performance with 4 threads is 2.598 GBlocks/sec
 
Running tests on 4 thread system.
Performance with 1 threads is 0.660 GBlocks/sec
Performance with 2 threads is 1.294 GBlocks/sec
Performance with 3 threads is 1.879 GBlocks/sec
Performance with 4 threads is 2.615 GBlocks/sec

It seems this compares with the benchmark results we did using fractal extreme itself. The i7 2600 was 1,7 times faster than my q6700, and the fx-8150 was 35% faster than the i7 2600. The predictable result for the fx-8150 would be:
2,615 * 1,7 * 1,35 = 6,001425
Which corresponds exactly to the test by stardust4ever.

Bruce Dawson, which 12-thread CPU do you have, which model?
Logged

stardust4ever
Conqueror
*******
Posts: 147



« Reply #34 on: March 26, 2012, 05:05:25 AM »

I used AMD Overdrive to attempt temporarily bumping my processor up to 4.5Ghz as many of the enthusiasts over at Overclockers.net have; my CPU temp went up to 74C and Windows 7 crashed with the "Blue Screen of Death" halfway during the Intel Burn Test, so I'm just going to leave it set at 4.2Ghz permanently. Upon restart, I was prompted by Windows to boot into safe mode, so I did. The benchmark test is a lot more stable in Safe Mode, and with zero Windows services running, I got slightly better results:
Quote
CPU information:
VendorID = 'AuthenticAMD'
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU's rdtsc speed is 4.228 GHz (peak speed may be higher).

Running tests on 8 thread system.
Performance with 1 threads is 0.864 GBlocks/sec
Performance with 2 threads is 1.694 GBlocks/sec
Performance with 3 threads is 2.592 GBlocks/sec
Performance with 4 threads is 3.311 GBlocks/sec
Performance with 5 threads is 4.123 GBlocks/sec
Performance with 6 threads is 4.758 GBlocks/sec
Performance with 7 threads is 5.445 GBlocks/sec
Performance with 8 threads is 6.123 GBlocks/sec

Running tests on 8 thread system.
Performance with 1 threads is 0.826 GBlocks/sec
Performance with 2 threads is 1.688 GBlocks/sec
Performance with 3 threads is 2.543 GBlocks/sec
Performance with 4 threads is 3.457 GBlocks/sec
Performance with 5 threads is 4.082 GBlocks/sec
Performance with 6 threads is 4.759 GBlocks/sec
Performance with 7 threads is 5.442 GBlocks/sec
Performance with 8 threads is 6.125 GBlocks/sec

Running tests on 8 thread system.
Performance with 1 threads is 0.864 GBlocks/sec
Performance with 2 threads is 1.651 GBlocks/sec
Performance with 3 threads is 2.516 GBlocks/sec
Performance with 4 threads is 3.247 GBlocks/sec
Performance with 5 threads is 4.077 GBlocks/sec
Performance with 6 threads is 4.792 GBlocks/sec
Performance with 7 threads is 5.450 GBlocks/sec
Performance with 8 threads is 6.120 GBlocks/sec

I believe the "Safe Mode" test better measures the true performance ceiling of my system at 6.120-6.125 Gblocks (with my processor running at 4.2Ghz). Under Safe Mode, the bench didn't have any of the wild fluctuations, ie 5.3 < X < 6.1 like it did in normal mode. I suspect maybe the Windows Areo interface is to blame. I should probably disable Areo and just run Fractal Extreme with the Windows Classic look, but that's so 2000-ish, and my windows look so pretty in translucent purple!
Logged
brucedawson
Forums Newbie
*
Posts: 6


« Reply #35 on: March 26, 2012, 05:36:12 PM »

I ran the test 5 times. My peak score out of 5 trials was 6.009 Gblocks/sec. I am running an  AMD FX-8150 8-core Bulldozer processor @4.2Ghz.
...
Bruce, one thing I am curious to know, is if thread stuffing, or running more process threads than available CPU threads, would have an impact on performance.

The memory subsystem should make no difference. The inner loop of FX's calculations fits entirely in the cache, so the memory subsystem and disks basically sit idle during calculations.

4.228 GHz / 0.857 GBlocks/sec on Bulldozer works out to one block processed every five cycles. That's pretty good, although Sandybridge seems to manage slightly better than that, since my work machine manages 1.15 GBlocks/sec at less than 4 GHz. Assuming that my laptop is TurboBoosting up to 3.2 GHz its 0.871 GBlocks/sec works out to one block processed every 3.26 cycles. That's 50% faster on a clock-by-clock basis.

Adding more threads beyond num-hardware-threads shouldn't help. I could try it, but I would doubt it would help.

I'm not entirely sure what makes the per-thread performance drop as #threads increases. There are three likely causes, which can interace:
1) On Sandybridge the second half of the 'cores' are just hyperthreads. They share the same execution resources with the first set of cores. Therefore they are bottlenecked by the raw execution resources. I think using both hyperthreads on a core might hide more instruction latency, which is why performance goes up a bit. Bulldozer threads are more independent, but not totally. The equivalent on Bulldozer is that the front-end is shared between two cores on the same module, so instruction decode can be a bottleneck.
2) On Sandybridge the clock speed should be reduced as load increases. On my laptop I would guess that the frequency drops as I go from one thread to four.
3) Even an 'idle' system has some housekeeping going on. When only one calculation thread is running this doesn't affect the results, but when FX tries to use all threads it starts to interfere a bit. If you don't have lots of extraneous programs running then this should normally be a modest effect.

It sounds like #3 is responsible for most of the slowdown on the Bulldozer, but it's curious that on the idle system the speed increase going from one thread to eight threads is about 7.1. I would have expected closer to 8. Maybe #1 is also a factor?

It's interesting that the six core (twelve thread) Sandybridge is slightly faster than the 8 core (four module) Bulldozer. The lower latency per block is apparently enough to make up for the higher core count and the higher clock rate.

My laptop is an Intel(R) Core(TM) i7-2720QM @ 2.20 GHz, peak is 3.2 GHz?
My work machine is an Intel(R) Core(TM) i7-3930K @ 3.20 GHz, peak is 3.5 GHz?

I haven't compared the costs, but I would guess that a reasonably fast Sandybridge with as many cores as possible is the way to go.
Logged
Dinkydau
Fractal Senior
******
Posts: 1285


1337


WWW
« Reply #36 on: March 26, 2012, 07:16:53 PM »

The "normal" i7 processors can't be used 2 at the same time. Xeon processors can. Probably this is the closest equivalent:
http://www.newegg.com/Product/Product.aspx?Item=N82E16819117242
Maybe I should have bought 2 of those after all. I think that with 2,4 GHz the opterons will be at least just as fast, only I will be missing the high performance on other types of applications, for $50/CPU more. But who knows how expensive the motherboard would have been, I think the purchase of the opterons wasn't that bad.
Logged

Dinkydau
Fractal Senior
******
Posts: 1285


1337


WWW
« Reply #37 on: March 26, 2012, 11:00:25 PM »

Just found that the processor I linked to in the previous post can't be used with 2 at the same time. I could have known that because I had read that before. I was kind of amazed to see an intel CPU like that for that price at first, but it turns out AMD is really the only option at the moment to keep the price for a computer somewhat reasonable.

This is the true alternative:
http://www.newegg.com/Product/Product.aspx?Item=N82E16819117228

$540 against $1640
« Last Edit: March 26, 2012, 11:03:44 PM by Dinkydau » Logged

brucedawson
Forums Newbie
*
Posts: 6


« Reply #38 on: March 27, 2012, 05:18:05 AM »

I found that my work system was constantly using 6% of CPU due to some driver in a bad power state, so I rebooted and re-ran the tests. This increases the peak performance slightly, and suggests that the true minimum time for each block on Sandybridge might be 3 cycles. This isn't normally reached for some reason, but it's an interesting number.

Anyway, here's the results. As I said before this is on a:

My work machine is an Intel(R) Core(TM) i7-3930K @ 3.20 GHz, peak is 3.5 GHz?

CPU information:
VendorID = 'GenuineIntel'
Stepping = 0x6, model = 0xD, family = 0x6, type = 0
Signature is 0x6D6 (06_0DH)
CPU's rdtsc speed is 3.202 GHz (peak speed may be higher).

Running tests on 12 thread system.
Performance with 1 threads is 1.150 GBlocks/sec
Performance with 2 threads is 2.277 GBlocks/sec
Performance with 3 threads is 3.445 GBlocks/sec
Performance with 4 threads is 4.373 GBlocks/sec
Performance with 5 threads is 5.038 GBlocks/sec
Performance with 6 threads is 5.686 GBlocks/sec
Performance with 7 threads is 5.902 GBlocks/sec
Performance with 8 threads is 5.960 GBlocks/sec
Performance with 9 threads is 6.085 GBlocks/sec
Performance with 10 threads is 6.259 GBlocks/sec
Performance with 11 threads is 6.573 GBlocks/sec
Performance with 12 threads is 7.014 GBlocks/sec
Logged
brucedawson
Forums Newbie
*
Posts: 6


« Reply #39 on: March 27, 2012, 05:21:02 AM »

Here's some benchmark results on my intel q6700, 4×2,66 GHz, 4 threads:

Can you post these again, with the CPU ID information? I'm collecting the results and it's good to have that information, in additional the model information.
Logged
Dinkydau
Fractal Senior
******
Posts: 1285


1337


WWW
« Reply #40 on: March 27, 2012, 12:02:48 PM »

I've also sent the information to your e-mail address.

Code:
This program measures the performance of the Fractal eXtreme high-precision inner block. This block repeats a sequence of five 64-bit instructions (mov/mul/add/adc/adc) that take two 64-bit inputs and add their 128-bit product to a 192-bit accumulator. The speed of this operation determines the speed of deep zooms in Fractal eXtreme. Performance will vary depending on system load. Close other programs for maximum performance and consistency.
 
It takes 272 of these sequences to square a 1,024 bit number.
 
Similar inner loops are used in cryptography and other high-precision math.
 
 
 
VendorID = 'GenuineIntel'
Stepping = 0xB, model = 0xF, family = 0x6, type = 0
Signature is 0x6FB (06_0FH)
CPU's rdtsc speed is 2.673 GHz (peak speed may be higher).
 
Running tests on 4 thread system.
Performance with 1 threads is 0.661 GBlocks/sec
Performance with 2 threads is 1.317 GBlocks/sec
Performance with 3 threads is 1.969 GBlocks/sec
Performance with 4 threads is 2.598 GBlocks/sec
 
Running tests on 4 thread system.
Performance with 1 threads is 0.660 GBlocks/sec
Performance with 2 threads is 1.294 GBlocks/sec
Performance with 3 threads is 1.879 GBlocks/sec
Performance with 4 threads is 2.615 GBlocks/sec
Logged

brucedawson
Forums Newbie
*
Posts: 6


« Reply #41 on: March 29, 2012, 08:16:05 AM »

I've also sent the information to your e-mail address.

I summarized the information into a blog post which you can find here:

http://randomascii.wordpress.com/2012/03/28/fractal-and-crypto-performance/

TL;DR. Higher clock rate is better, more cores are better (Bulldozer cores count, Sandybridge hyperthreads don't), microarchitectural efficiencies make a big difference.

Let me know if you get any more results that I should add.
Logged
stardust4ever
Conqueror
*******
Posts: 147



« Reply #42 on: March 29, 2012, 10:44:30 AM »

I posted a comment on your blog post.
http://randomascii.wordpress.com/2012/03/28/fractal-and-crypto-performance/#comment-467

I have a strong suspicion that the older Phenom II architecture has fewer clocks per block than the newer Bulldozer architecture, based on the fact that with my previous bench file, an 8-core Bulldozer processor running at 4.2Ghz took more than half the time to render (2:41) as a 4-core Phenom II running at 3.2Ghz (4:55).

Bruce, I also have two unrelated questions for you:

Do you think the new AVX instruction set included in the Bulldozer and Sandy Bridge processor architecture will allow faster calculations? I read that it can stack two 64-bit registers together to make 128 bits, or something like that? Will this allow future 128-bit integer calculation on 64-bit hardware, or am I dreaming? If so, that would be an awesome boon for deep-zoom fractal exploration.

Also, have you considered implementing the Karatsuba algorithm for long multiplication, if you haven't already? I believe it would provide significant speed improvement at precisions beyond 1024 bits...
http://en.wikipedia.org/wiki/Karatsuba_algorithm
« Last Edit: March 29, 2012, 11:04:10 AM by stardust4ever » Logged
Dinkydau
Fractal Senior
******
Posts: 1285


1337


WWW
« Reply #43 on: March 29, 2012, 03:03:49 PM »

AVX instructions sound like a very good idea. I've read a lot of good things about that. I think it could really make rendering a lot faster in the near of far future.
Logged

hobold
Fractal Phenom
******
Posts: 459


« Reply #44 on: March 29, 2012, 03:45:00 PM »

The currently existing AVX implementations (both Intel's and AMD's) can yield speedups for floating point computations, but the machines have other bottlenecks which make it tricky to realize the potential. Integer arithmetic, in contrast, is not really accelerated by AVX. We'll have to wait for AVX2 for that.
Logged
Pages: 1 2 [3] 4 5   Go Down
  Print  
 
Jump to:  



Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2013, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.784 seconds with 30 queries. (Pretty URLs adds 0.051s, 2q)