diff options
Diffstat (limited to 'gsl-1.9/doc/ieee754.texi')
-rw-r--r-- | gsl-1.9/doc/ieee754.texi | 461 |
1 files changed, 461 insertions, 0 deletions
diff --git a/gsl-1.9/doc/ieee754.texi b/gsl-1.9/doc/ieee754.texi new file mode 100644 index 0000000..4cfecda --- /dev/null +++ b/gsl-1.9/doc/ieee754.texi @@ -0,0 +1,461 @@ +@cindex IEEE floating point + +This chapter describes functions for examining the representation of +floating point numbers and controlling the floating point environment of +your program. The functions described in this chapter are declared in +the header file @file{gsl_ieee_utils.h}. + +@menu +* Representation of floating point numbers:: +* Setting up your IEEE environment:: +* IEEE References and Further Reading:: +@end menu + +@node Representation of floating point numbers +@section Representation of floating point numbers +@cindex IEEE format for floating point numbers +@cindex bias, IEEE format +@cindex exponent, IEEE format +@cindex sign bit, IEEE format +@cindex mantissa, IEEE format +The IEEE Standard for Binary Floating-Point Arithmetic defines binary +formats for single and double precision numbers. Each number is composed +of three parts: a @dfn{sign bit} (@math{s}), an @dfn{exponent} +(@math{E}) and a @dfn{fraction} (@math{f}). The numerical value of the +combination @math{(s,E,f)} is given by the following formula, +@tex +\beforedisplay +$$ +(-1)^s (1 \cdot fffff\dots) 2^E +$$ +\afterdisplay +@end tex +@ifinfo + +@example +(-1)^s (1.fffff...) 2^E +@end example + +@end ifinfo +@noindent +@cindex normalized form, IEEE format +@cindex denormalized form, IEEE format +The sign bit is either zero or one. The exponent ranges from a minimum value +@c{$E_{min}$} +@math{E_min} +to a maximum value +@c{$E_{max}$} +@math{E_max} depending on the precision. The exponent is converted to an +unsigned number +@math{e}, known as the @dfn{biased exponent}, for storage by adding a +@dfn{bias} parameter, +@c{$e = E + \hbox{\it bias}$} +@math{e = E + bias}. +The sequence @math{fffff...} represents the digits of the binary +fraction @math{f}. The binary digits are stored in @dfn{normalized +form}, by adjusting the exponent to give a leading digit of @math{1}. +Since the leading digit is always 1 for normalized numbers it is +assumed implicitly and does not have to be stored. +Numbers smaller than +@c{$2^{E_{min}}$} +@math{2^(E_min)} +are be stored in @dfn{denormalized form} with a leading zero, +@tex +\beforedisplay +$$ +(-1)^s (0 \cdot fffff\dots) 2^{E_{min}} +$$ +\afterdisplay +@end tex +@ifinfo + +@example +(-1)^s (0.fffff...) 2^(E_min) +@end example + +@end ifinfo +@noindent +@cindex zero, IEEE format +@cindex infinity, IEEE format +This allows gradual underflow down to +@c{$2^{E_{min} - p}$} +@math{2^(E_min - p)} for @math{p} bits of precision. +A zero is encoded with the special exponent of +@c{$2^{E_{min}-1}$} +@math{2^(E_min - 1)} and infinities with the exponent of +@c{$2^{E_{max}+1}$} +@math{2^(E_max + 1)}. + +@noindent +@cindex single precision, IEEE format +The format for single precision numbers uses 32 bits divided in the +following way, + +@smallexample +seeeeeeeefffffffffffffffffffffff + +s = sign bit, 1 bit +e = exponent, 8 bits (E_min=-126, E_max=127, bias=127) +f = fraction, 23 bits +@end smallexample + +@noindent +@cindex double precision, IEEE format +The format for double precision numbers uses 64 bits divided in the +following way, + +@smallexample +seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff + +s = sign bit, 1 bit +e = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023) +f = fraction, 52 bits +@end smallexample + +@noindent +It is often useful to be able to investigate the behavior of a +calculation at the bit-level and the library provides functions for +printing the IEEE representations in a human-readable form. + +@comment float vs double vs long double +@comment (how many digits are available for each) + +@deftypefun void gsl_ieee_fprintf_float (FILE * @var{stream}, const float * @var{x}) +@deftypefunx void gsl_ieee_fprintf_double (FILE * @var{stream}, const double * @var{x}) +These functions output a formatted version of the IEEE floating-point +number pointed to by @var{x} to the stream @var{stream}. A pointer is +used to pass the number indirectly, to avoid any undesired promotion +from @code{float} to @code{double}. The output takes one of the +following forms, + +@table @code +@item NaN +the Not-a-Number symbol + +@item Inf, -Inf +positive or negative infinity + +@item 1.fffff...*2^E, -1.fffff...*2^E +a normalized floating point number + +@item 0.fffff...*2^E, -0.fffff...*2^E +a denormalized floating point number + +@item 0, -0 +positive or negative zero + +@comment @item [non-standard IEEE float], [non-standard IEEE double] +@comment an unrecognized encoding +@end table + +The output can be used directly in GNU Emacs Calc mode by preceding it +with @code{2#} to indicate binary. +@end deftypefun + +@deftypefun void gsl_ieee_printf_float (const float * @var{x}) +@deftypefunx void gsl_ieee_printf_double (const double * @var{x}) +These functions output a formatted version of the IEEE floating-point +number pointed to by @var{x} to the stream @code{stdout}. +@end deftypefun + +@noindent +The following program demonstrates the use of the functions by printing +the single and double precision representations of the fraction +@math{1/3}. For comparison the representation of the value promoted from +single to double precision is also printed. + +@example +@verbatiminclude examples/ieee.c +@end example + +@noindent +The binary representation of @math{1/3} is @math{0.01010101... }. The +output below shows that the IEEE format normalizes this fraction to give +a leading digit of 1, + +@smallexample + f= 1.01010101010101010101011*2^-2 +fd= 1.0101010101010101010101100000000000000000000000000000*2^-2 + d= 1.0101010101010101010101010101010101010101010101010101*2^-2 +@end smallexample + +@noindent +The output also shows that a single-precision number is promoted to +double-precision by adding zeros in the binary representation. + +@comment importance of using 1.234L in long double calculations + +@comment @example +@comment int main (void) +@comment @{ +@comment long double x = 1.0, y = 1.0; + +@comment x = x + 0.2; +@comment y = y + 0.2L; + +@comment printf(" d %.20Lf\n",x); +@comment printf("ld %.20Lf\n",y); + +@comment return 1; +@comment @} + +@comment d 1.20000000000000001110 +@comment ld 1.20000000000000000004 +@comment @end example + + +@node Setting up your IEEE environment +@section Setting up your IEEE environment +@cindex IEEE exceptions +@cindex precision, IEEE arithmetic +@cindex rounding mode +@cindex arithmetic exceptions +@cindex exceptions, IEEE arithmetic +@cindex division by zero, IEEE exceptions +@cindex underflow, IEEE exceptions +@cindex overflow, IEEE exceptions +The IEEE standard defines several @dfn{modes} for controlling the +behavior of floating point operations. These modes specify the important +properties of computer arithmetic: the direction used for rounding (e.g. +whether numbers should be rounded up, down or to the nearest number), +the rounding precision and how the program should handle arithmetic +exceptions, such as division by zero. + +Many of these features can now be controlled via standard functions such +as @code{fpsetround}, which should be used whenever they are available. +Unfortunately in the past there has been no universal API for +controlling their behavior---each system has had its own low-level way +of accessing them. To help you write portable programs GSL allows you +to specify modes in a platform-independent way using the environment +variable @code{GSL_IEEE_MODE}. The library then takes care of all the +necessary machine-specific initializations for you when you call the +function @code{gsl_ieee_env_setup}. + +@deftypefun void gsl_ieee_env_setup () +This function reads the environment variable @code{GSL_IEEE_MODE} and +attempts to set up the corresponding specified IEEE modes. The +environment variable should be a list of keywords, separated by +commas, like this, + +@display +@code{GSL_IEEE_MODE} = "@var{keyword},@var{keyword},..." +@end display + +@noindent +where @var{keyword} is one of the following mode-names, + +@itemize @asis +@item +@code{single-precision} +@item +@code{double-precision} +@item +@code{extended-precision} +@item +@code{round-to-nearest} +@item +@code{round-down} +@item +@code{round-up} +@item +@code{round-to-zero} +@item +@code{mask-all} +@item +@code{mask-invalid} +@item +@code{mask-denormalized} +@item +@code{mask-division-by-zero} +@item +@code{mask-overflow} +@item +@code{mask-underflow} +@item +@code{trap-inexact} +@item +@code{trap-common} +@end itemize + +If @code{GSL_IEEE_MODE} is empty or undefined then the function returns +immediately and no attempt is made to change the system's IEEE +mode. When the modes from @code{GSL_IEEE_MODE} are turned on the +function prints a short message showing the new settings to remind you +that the results of the program will be affected. + +If the requested modes are not supported by the platform being used then +the function calls the error handler and returns an error code of +@code{GSL_EUNSUP}. + +When options are specified using this method, the resulting mode is +based on a default setting of the highest available precision (double +precision or extended precision, depending on the platform) in +round-to-nearest mode, with all exceptions enabled apart from the +@sc{inexact} exception. The @sc{inexact} exception is generated +whenever rounding occurs, so it must generally be disabled in typical +scientific calculations. All other floating-point exceptions are +enabled by default, including underflows and the use of denormalized +numbers, for safety. They can be disabled with the individual +@code{mask-} settings or together using @code{mask-all}. + +The following adjusted combination of modes is convenient for many +purposes, + +@example +GSL_IEEE_MODE="double-precision,"\ + "mask-underflow,"\ + "mask-denormalized" +@end example + +@noindent +This choice ignores any errors relating to small numbers (either +denormalized, or underflowing to zero) but traps overflows, division by +zero and invalid operations. +@end deftypefun + +@noindent +To demonstrate the effects of different rounding modes consider the +following program which computes @math{e}, the base of natural +logarithms, by summing a rapidly-decreasing series, +@tex +\beforedisplay +$$ +e = 1 + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots + = 2.71828182846... +$$ +\afterdisplay +@end tex +@ifinfo + +@example +e = 1 + 1/2! + 1/3! + 1/4! + ... + = 2.71828182846... +@end example +@end ifinfo + +@example +@verbatiminclude examples/ieeeround.c +@end example + +@noindent +Here are the results of running the program in @code{round-to-nearest} +mode. This is the IEEE default so it isn't really necessary to specify +it here, + +@example +$ GSL_IEEE_MODE="round-to-nearest" ./a.out +i= 1 sum=1.000000000000000000 error=-1.71828 +i= 2 sum=2.000000000000000000 error=-0.718282 +.... +i=18 sum=2.718281828459045535 error=4.44089e-16 +i=19 sum=2.718281828459045535 error=4.44089e-16 +@end example + +@noindent +After nineteen terms the sum converges to within @c{$4 \times 10^{-16}$} +@math{4 \times 10^-16} of the correct value. +If we now change the rounding mode to +@code{round-down} the final result is less accurate, + +@example +$ GSL_IEEE_MODE="round-down" ./a.out +i= 1 sum=1.000000000000000000 error=-1.71828 +.... +i=19 sum=2.718281828459041094 error=-3.9968e-15 +@end example + +@noindent +The result is about +@c{$4 \times 10^{-15}$} +@math{4 \times 10^-15} +below the correct value, an order of magnitude worse than the result +obtained in the @code{round-to-nearest} mode. + +If we change to rounding mode to @code{round-up} then the series no +longer converges (the reason is that when we add each term to the sum +the final result is always rounded up. This is guaranteed to increase the sum +by at least one tick on each iteration). To avoid this problem we would +need to use a safer converge criterion, such as @code{while (fabs(sum - +oldsum) > epsilon)}, with a suitably chosen value of epsilon. + +Finally we can see the effect of computing the sum using +single-precision rounding, in the default @code{round-to-nearest} +mode. In this case the program thinks it is still using double precision +numbers but the CPU rounds the result of each floating point operation +to single-precision accuracy. This simulates the effect of writing the +program using single-precision @code{float} variables instead of +@code{double} variables. The iteration stops after about half the number +of iterations and the final result is much less accurate, + +@example +$ GSL_IEEE_MODE="single-precision" ./a.out +.... +i=12 sum=2.718281984329223633 error=1.5587e-07 +@end example + +@noindent +with an error of +@c{$O(10^{-7})$} +@math{O(10^-7)}, which corresponds to single +precision accuracy (about 1 part in @math{10^7}). Continuing the +iterations further does not decrease the error because all the +subsequent results are rounded to the same value. + +@node IEEE References and Further Reading +@section References and Further Reading + +The reference for the IEEE standard is, + +@itemize @asis +@item +ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic. +@end itemize + +@noindent +A more pedagogical introduction to the standard can be found in the +following paper, + +@itemize @asis +@item +David Goldberg: What Every Computer Scientist Should Know About +Floating-Point Arithmetic. @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 1 +(March 1991), pages 5--48. + +Corrigendum: @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 3 (September +1991), page 413. and see also the sections by B. A. Wichmann and Charles +B. Dunham in Surveyor's Forum: ``What Every Computer Scientist Should +Know About Floating-Point Arithmetic''. @cite{ACM Computing Surveys}, +Vol.@: 24, No.@: 3 (September 1992), page 319. +@end itemize + +@noindent + +A detailed textbook on IEEE arithmetic and its practical use is +available from SIAM Press, + +@itemize @asis +@item +Michael L. Overton, @cite{Numerical Computing with IEEE Floating Point Arithmetic}, +SIAM Press, ISBN 0898715717. +@end itemize + +@noindent + +@comment to turn on math exception handling use __setfpucw, see +@comment /usr/include/i386/fpu_control.h +@comment e.g. +@comment #include <math.h> +@comment #include <stdio.h> +@comment #include <fpu_control.h> +@comment double f (double x); +@comment int main () +@comment { +@comment double a = 0; +@comment double y, z; +@comment __setfpucw(0x1372); +@comment mention extended vs double precision on Pentium, and how to get around +@comment it (-ffloat-store, or selective use of volatile) + +@comment On the alpha the option -mieee is needed with gcc +@comment In Digital's compiler the equivalent is -ieee, -ieee-with-no-inexact +@comment and -ieee-with-inexact, or -fpe1 or -fpe2 |