This website uses cookies. By clicking Accept, you consent to the use of cookies. Click Here to learn more about how we use cookies.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- Re: confusion about f64 precision

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

maxreason

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-15-2012
07:34 PM

confusion about f64 precision

I'm not sure this is the best forum for this question, so please move it if I guessed wrong. Since my previous related question got moved here, it is easier to continue here.

I just wrote some routines that you might call "vector functions". There will be a whole bunch of them eventually, but to start I wrote a function that computes the sine of the four f64 (double-precision) values in an f64vec4 variable, and another function that computes the cosine of the four f64 values in an f64vec4 variable.

I wrote these functions in assembly-language taking full advantage of the latest SIMD/AVX/FMA instructions the AMD bulldozer CPUs offer. You can see the assembly-language for one of these two working functions in my previous message. I'm pretty sure that's the latest, greatest working code (only edited to make the comments not-wrap and make the listing ugly and unreadable).

Here is my problem. I wrote 16 versions of the 4 * sine routine and 16 versions of the 4 * cosine routine. They are all based upon chebyshev approximations, but some have 7, 8, 9, 10, 11 coefficients. I also wrote versions with fused multiply-add and separated multiply add. I also wrote versions with vbroadcastsd and load 4 values of the coefficients in parallel. I tried just about every variation I could think of --- but still have a "precision problem".

Here is what I face. I call all these functions and get return values. I also call the C standard math library** sin() **and** cos() **functions. I also call my own **math_sin() **and** math_cos() **functions that I wrote in assembly language that just shove the values into the ancient 80x87 FPU and execute the FPU sin and cos instructions. I also call equivalent functions I wrote in C code. I also call** amd_sin() **and** amd_cos() **in the AMD math library. Here is what the results show over a range of inputs:

First, the** sin()**,** math_sin() **and** amd_sin() **functions always return the same values to the final least significant bit, and ditto for the** cos()**, **math_cos() **and** amd_cos() **functions. I don't recall ever seeing even one bit difference, and I print out the xor of the 64-bit values (as 64-bit integers) as one measure of the difference between the results (because it makes it easy to judge the difference in bits).

Within the range 0.0000 to PI/2, where these functions are supposed to work best, the results of my 16+ chebyshev computations *sometimes* agree to the last bit. But in a fair percentage of input values the error varies between 1-bit and ** 7~8 bits** of error. And rarely a difference of up to

** input angle (radians) : result of sin() result of test function : result of cos() result of test function : s^ c^ (stst*stst) + (ctst*ctst)**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166659 : 0.999950000416665263 0.999950000416663376 : c 33 : 0.999999999999996225**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166645 : 0.999950000416665263 0.999950000416665263 : ff4 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166675 : 0.999950000416665263 0.999950000416665263 : 5 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166675 : 0.999950000416665263 0.999950000416665263 : 5 0 : 1.000000000000000000**

The two lines with the most whitespace around them are an XOR of the bit patterns of the result of sine and cosine routines versus the results of** sin() **and** cos()**... which is the result of the FPU instructions. The 1st column is the input angle in radians. The 2nd column is the result of** sin()**. The 3rd column is the result of the routine being tested, which is named in the last column. The 4th and 5th columns are the same as the 2nd and 3rd except for cosine instead of sine. The 6th and 7th columns are the columns to watch. They contain the bitwise XOR of the result of the tested sine and cosine function against the result of sin() and cos() respectively. The 8th column is simply the sum of the squares of the sine and cosine results, which should always equal 1.0000000000 if the sine and cosine are correct. The 9th column is the function name for each row, which didn't fit in this forum format, so I list the order below:

**sin() and cos() routines in C standard math library**

**my old assembly language routines that execute FPU sin and cos instructions**

**amd_sin() and amd_cos() in the AMD math library (which presumably execute bulldozer-specific instructions on my FX8150)**

**my medium precision routines written in C (about 9 coefficients)**

**my high precision routines written in C (about 11 coefficients)**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 1st result**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 2nd result**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 3rd result**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 4th result**

**my hand coded SIMD/AVX/FMA assembly-language computing 7 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 8 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 9 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 10 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev with broadcastsd and without fused multiply-add**

Without including a whole slew of these printouts as testimony, I can say without any doubt that the** sin_c09() **and** cos_c09() **routines print out the smallest difference between** sin() **and** cos() **as any of my routines, and as a result are what I adopted in my 4 * sin and 4 * cos routines. Just to start, this situation makes no sense --- on the surface, anyway --- because the 10 and 11 coefficient routines are supposed to be more precise than the 7 and 8 and 9 coefficient routines. Why would my 9 coefficient routine better match the "gold standard" FPU and AMD routines... unless by some accident that's also what they chose --- maybe even [almost] the same exact coefficients???

However, "what they chose" is not the same statement as "the correct answer". At the moment, I have no way to verify the AMD and intel claims that their routines are correct to within 1 bit (or 2 bits). As a matter of * fact*, the C library functions, the 80x87 FPU instructions, and the

**0.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.000000000000000000 0.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.000000000000000000 0.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408250 : 0.998078615601782904 0.998078615601781349 : 3fb 12 : 0.9999999999999970020.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408160 : 0.998078615601782904 0.998078615601782904 : 3ee 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408340 : 0.998078615601782904 0.998078615601782904 : 8 0 : 1.0000000000000002220.062000000000000048 : 0.061960286300408285 0.061960286300408340 : 0.998078615601782904 0.998078615601782904 : 8 0 : 1.000000000000000222**

Here we see the 8-coefficient and 10-coefficient sine routines generate grossly different results, but modest or zero error cosine results. We also see the 11-coefficient sine routine generating horrific results. But the wimpy 7-coefficient sine and cosine are "dead on" to the final bit, as is my trusty 9-coefficient routine.

All I can say is, the more I scroll through a console window full of 100s of these outputs at different angles is.... WTF.

Before you float your favorite theory in response, consider the following table of results:

** **

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176158 : 1 1f : 1.000000000000000222**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945541 : 0.994802508570176047 0.994802508570175159 : 5 8 : 0.999999999999998113**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945388 : 0.994802508570176047 0.994802508570176047 : e 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945665 : 0.994802508570176047 0.994802508570176047 : 1fa 0 : 1.000000000000000000**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945665 : 0.994802508570176047 0.994802508570176047 : 1fa 0 : 1.000000000000000000**

I call your attention to the large errors listed for the sine in the 11-coefficient routines. My question is this. If the FPU and AMD sine and cosine routines are so perfect, then why does the sum of the squares AKA ((sine*sine) + (cosine*cosine)) come out perfect (exactly 1.0000) for the 10 and 11 coefficient values, but not for the "perfect" FPU and AMD values?

Like I said before, WTF.

Maybe I'm being too "harsh". Maybe I should not expect better. But I don't believe this. A difference of 9 freaking bits on a very non-extreme case (angle = 0.102 radians which is roughly 6-degrees). And we have the simplest possible verification of these supposedly "wrong" values saying "correct", while that same verification test says "no cigar" to the "perfect" routines.

I am confused in several ways. Here is a sampling.

#1: Why do the AMD results computed in the SIMD/AVX registers ** always** agree to the very last bit with the ancient FPU values?

#2: Why do the more precise (more co-efficients) versions of the chebyshev computations generates less precise results than the 9-coefficient version, (which takes "agreement with FPU and AMD" as our definition of precise).

#3: Why is my 9-coefficient routine, whether executed purely in C code, or executed with one input value and xmm registers, or executed with 4 input values simultaneously in ymm registers... always closer to the FPU and AMD results.

#4: Why does my straightforward verification test claim the * supposedly-wrong-by-9-bits* results of the 11-coefficient routine is "exactly correct" and then turn around and claim the supposedly

I realize this is not a forum for "numerical analyst wizards". Frankly, if someone can point me to the appropriate forum, I'd appreciate that. But being a hardcore programmer and long time AMD fan, I thought I'd start here. Who knows, maybe the dudes who wrote the AMD math library watch this forum. Not likely, but who knows? Any math wizards out there?

But then there is the other possible issue --- the chance these problems aren't a math-wizard problem (!!!!! and I am certainly no math wizard!!!!!), but a problem with these "gold standard" trig routines. Unfortunately, I am definitely not enough of a math wizard to write routines that get higher precision in order to test my double-precision routines, the FPU routines, and the AMD math library routines.

I've beaten myself up quite a bit looking for answers. I've double, triple and quadruple checked all my constants against other sources.

Anyone else have any brilliant ideas?

2 Replies

drallan

Challenger

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-15-2012
11:35 PM

Re: confusion about f64 precision

Hi maxreason,

Agree, it doesn't make sense that your test produces the same number for different input! When physics kept saying the speed of light can't possibly be the same regardless of your motion, Einstein said "Um, maybe it's your watch". So, maybe it's your test. S*S + C*C produces 2 kinds of rounding error both of which can contribute to the problem. Addition is particularly prone to error when adding numbers of different magnitude, as in your first case where you take the sin and cos of a number near 0.

A better test is probably what you already have, sum of absolute **xor **differences, that's pretty precise.

As for a standard, you can use most any multi precision calculator as a reference, and there several C examples out there so you can link to your work. The following is from windows calculator and shows how the rounding has affected the test for the 10 coefficient routine.

** AMD/Intel xxx_c10() **

0.009999833334166666 0.009999833334166645 sin

0.999950000416665263 0.999950000416665263 cos

1.000000000000000000 1.000000000000000000 (stest*stest) + (ctest*ctest)

0.999999999999999970 0.999999999999999970 'true' sin^2+cos^2 truncated from 40 digit calculator

** full 40 digit calculator results for sin*sin+cos*cos test**

0.999999999999999970.46731085805641 AMD/Intel 40 digit sin*sin+cos*cos

0.999999999999999970.04731785802142 xxx_c10() 40 digit sin*sin+cos*cos

As for lower accuracy of the longer series, if they use the same series and coefficients then maybe rounding error comes into play here too, i.e., if you already have an accurate answer, then add more terms and it gets worse. You might need to control rounding in the mmx/avx unit. In gcc you can use these functions to program the mmx unit rounding and denormals, I don't remember the flags but you can look them up.

**#include <immintrin.h>**

**oldflags=__builtin_ia32_stmxcsr();**

**__builtin_ia32_ldmxcsr(newflags);**

Nice to see people still programming computers. You should try the GPUs if you haven't already. I just discovered that the 7970's hardware float sin instruction (4 clocks, not microcode) has a near full float accuracy. Earlier GPUs had very low accuracy just for graphics. And, it's 4 quadrant.

maxreason

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-16-2012
04:18 AM

Re: confusion about f64 precision

Yeah, that verification test probably isn't very good. But it was the only thing I could think of that didn't depend on knowing things I can't or don't know. I'm not sure whether an arbitrary digit calculator can help me. I'd have to know whether their sin() and cos() functions are based upon co-efficients like mine routines are. If so, then their routines are only as good as their coefficients, and how many digits are those? And how would we know? On the other hand, perhaps those arbitrary precision calculators adopt the taylor series technique, which is slower than chebyshev, but doesn't need any co-efficients (the algorithm more-or-less builds the co-efficients as it computes).

The problem with the XOR test is simply this. It simply presumes the result produced by the FPU sin() and cos(), and the amd_sin() and amd_cos() produce the correct answer. How do I know that, especially when supposedly very precise routines disagree (the ones with 10 or 11 coefficients). The XOR test only tells me how much different from the FPU and amd_xxx() functions. If I knew for certain they are correct, then the XOR test is valid. But again, how do I know that, especially given the answers all over the place that I'm seeing?

Yeah, I write 3D graphics engines "on the side". My latest one is based upon OpenGL and GLSL, so I've done a fair bit of shader programming. I haven't done much OpenCL, just some quick fooling around and a promise to get back to it. The problem with GPUs for me is that double precision is barely, just barely sufficient for any work I do. Since the GPUs drop dead on double precision (because they have far too few double-precision compute units in them), I haven't gotten that interested in OpenCL programming. But believe me, if someone releases a GPU that has double-precision compute hardware for every core --- well, I'll be writing OpenCL until my brain explodes.

GPUs are really great. But I hope they inspire AMD to start * aggressively* pushing the #-of-cores envelope on their CPUs. I'd like to see a 64-core unit next year, a 256-core unit in 2014, and 1024-core unit in 2015, a 4096-core unit in 2016, and a 65536-core unit in 2017. You listening, AMD? It is a bit of a travesty to need to do general purpose computing on freaking GPU cards!!! That should happen in the CPU, and leave the GPU for graphics. If they have to strip out everything that isn't 64-bits from the bulldozer to save transistors, then so be it. No point in 8-bit, 16-bit and 32-bit computing any more --- none whatsoever.