summaryrefslogtreecommitdiff
path: root/gsl-1.9/doc/ieee754.texi
diff options
context:
space:
mode:
Diffstat (limited to 'gsl-1.9/doc/ieee754.texi')
-rw-r--r--gsl-1.9/doc/ieee754.texi461
1 files changed, 461 insertions, 0 deletions
diff --git a/gsl-1.9/doc/ieee754.texi b/gsl-1.9/doc/ieee754.texi
new file mode 100644
index 0000000..4cfecda
--- /dev/null
+++ b/gsl-1.9/doc/ieee754.texi
@@ -0,0 +1,461 @@
+@cindex IEEE floating point
+
+This chapter describes functions for examining the representation of
+floating point numbers and controlling the floating point environment of
+your program. The functions described in this chapter are declared in
+the header file @file{gsl_ieee_utils.h}.
+
+@menu
+* Representation of floating point numbers::
+* Setting up your IEEE environment::
+* IEEE References and Further Reading::
+@end menu
+
+@node Representation of floating point numbers
+@section Representation of floating point numbers
+@cindex IEEE format for floating point numbers
+@cindex bias, IEEE format
+@cindex exponent, IEEE format
+@cindex sign bit, IEEE format
+@cindex mantissa, IEEE format
+The IEEE Standard for Binary Floating-Point Arithmetic defines binary
+formats for single and double precision numbers. Each number is composed
+of three parts: a @dfn{sign bit} (@math{s}), an @dfn{exponent}
+(@math{E}) and a @dfn{fraction} (@math{f}). The numerical value of the
+combination @math{(s,E,f)} is given by the following formula,
+@tex
+\beforedisplay
+$$
+(-1)^s (1 \cdot fffff\dots) 2^E
+$$
+\afterdisplay
+@end tex
+@ifinfo
+
+@example
+(-1)^s (1.fffff...) 2^E
+@end example
+
+@end ifinfo
+@noindent
+@cindex normalized form, IEEE format
+@cindex denormalized form, IEEE format
+The sign bit is either zero or one. The exponent ranges from a minimum value
+@c{$E_{min}$}
+@math{E_min}
+to a maximum value
+@c{$E_{max}$}
+@math{E_max} depending on the precision. The exponent is converted to an
+unsigned number
+@math{e}, known as the @dfn{biased exponent}, for storage by adding a
+@dfn{bias} parameter,
+@c{$e = E + \hbox{\it bias}$}
+@math{e = E + bias}.
+The sequence @math{fffff...} represents the digits of the binary
+fraction @math{f}. The binary digits are stored in @dfn{normalized
+form}, by adjusting the exponent to give a leading digit of @math{1}.
+Since the leading digit is always 1 for normalized numbers it is
+assumed implicitly and does not have to be stored.
+Numbers smaller than
+@c{$2^{E_{min}}$}
+@math{2^(E_min)}
+are be stored in @dfn{denormalized form} with a leading zero,
+@tex
+\beforedisplay
+$$
+(-1)^s (0 \cdot fffff\dots) 2^{E_{min}}
+$$
+\afterdisplay
+@end tex
+@ifinfo
+
+@example
+(-1)^s (0.fffff...) 2^(E_min)
+@end example
+
+@end ifinfo
+@noindent
+@cindex zero, IEEE format
+@cindex infinity, IEEE format
+This allows gradual underflow down to
+@c{$2^{E_{min} - p}$}
+@math{2^(E_min - p)} for @math{p} bits of precision.
+A zero is encoded with the special exponent of
+@c{$2^{E_{min}-1}$}
+@math{2^(E_min - 1)} and infinities with the exponent of
+@c{$2^{E_{max}+1}$}
+@math{2^(E_max + 1)}.
+
+@noindent
+@cindex single precision, IEEE format
+The format for single precision numbers uses 32 bits divided in the
+following way,
+
+@smallexample
+seeeeeeeefffffffffffffffffffffff
+
+s = sign bit, 1 bit
+e = exponent, 8 bits (E_min=-126, E_max=127, bias=127)
+f = fraction, 23 bits
+@end smallexample
+
+@noindent
+@cindex double precision, IEEE format
+The format for double precision numbers uses 64 bits divided in the
+following way,
+
+@smallexample
+seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
+
+s = sign bit, 1 bit
+e = exponent, 11 bits (E_min=-1022, E_max=1023, bias=1023)
+f = fraction, 52 bits
+@end smallexample
+
+@noindent
+It is often useful to be able to investigate the behavior of a
+calculation at the bit-level and the library provides functions for
+printing the IEEE representations in a human-readable form.
+
+@comment float vs double vs long double
+@comment (how many digits are available for each)
+
+@deftypefun void gsl_ieee_fprintf_float (FILE * @var{stream}, const float * @var{x})
+@deftypefunx void gsl_ieee_fprintf_double (FILE * @var{stream}, const double * @var{x})
+These functions output a formatted version of the IEEE floating-point
+number pointed to by @var{x} to the stream @var{stream}. A pointer is
+used to pass the number indirectly, to avoid any undesired promotion
+from @code{float} to @code{double}. The output takes one of the
+following forms,
+
+@table @code
+@item NaN
+the Not-a-Number symbol
+
+@item Inf, -Inf
+positive or negative infinity
+
+@item 1.fffff...*2^E, -1.fffff...*2^E
+a normalized floating point number
+
+@item 0.fffff...*2^E, -0.fffff...*2^E
+a denormalized floating point number
+
+@item 0, -0
+positive or negative zero
+
+@comment @item [non-standard IEEE float], [non-standard IEEE double]
+@comment an unrecognized encoding
+@end table
+
+The output can be used directly in GNU Emacs Calc mode by preceding it
+with @code{2#} to indicate binary.
+@end deftypefun
+
+@deftypefun void gsl_ieee_printf_float (const float * @var{x})
+@deftypefunx void gsl_ieee_printf_double (const double * @var{x})
+These functions output a formatted version of the IEEE floating-point
+number pointed to by @var{x} to the stream @code{stdout}.
+@end deftypefun
+
+@noindent
+The following program demonstrates the use of the functions by printing
+the single and double precision representations of the fraction
+@math{1/3}. For comparison the representation of the value promoted from
+single to double precision is also printed.
+
+@example
+@verbatiminclude examples/ieee.c
+@end example
+
+@noindent
+The binary representation of @math{1/3} is @math{0.01010101... }. The
+output below shows that the IEEE format normalizes this fraction to give
+a leading digit of 1,
+
+@smallexample
+ f= 1.01010101010101010101011*2^-2
+fd= 1.0101010101010101010101100000000000000000000000000000*2^-2
+ d= 1.0101010101010101010101010101010101010101010101010101*2^-2
+@end smallexample
+
+@noindent
+The output also shows that a single-precision number is promoted to
+double-precision by adding zeros in the binary representation.
+
+@comment importance of using 1.234L in long double calculations
+
+@comment @example
+@comment int main (void)
+@comment @{
+@comment long double x = 1.0, y = 1.0;
+
+@comment x = x + 0.2;
+@comment y = y + 0.2L;
+
+@comment printf(" d %.20Lf\n",x);
+@comment printf("ld %.20Lf\n",y);
+
+@comment return 1;
+@comment @}
+
+@comment d 1.20000000000000001110
+@comment ld 1.20000000000000000004
+@comment @end example
+
+
+@node Setting up your IEEE environment
+@section Setting up your IEEE environment
+@cindex IEEE exceptions
+@cindex precision, IEEE arithmetic
+@cindex rounding mode
+@cindex arithmetic exceptions
+@cindex exceptions, IEEE arithmetic
+@cindex division by zero, IEEE exceptions
+@cindex underflow, IEEE exceptions
+@cindex overflow, IEEE exceptions
+The IEEE standard defines several @dfn{modes} for controlling the
+behavior of floating point operations. These modes specify the important
+properties of computer arithmetic: the direction used for rounding (e.g.
+whether numbers should be rounded up, down or to the nearest number),
+the rounding precision and how the program should handle arithmetic
+exceptions, such as division by zero.
+
+Many of these features can now be controlled via standard functions such
+as @code{fpsetround}, which should be used whenever they are available.
+Unfortunately in the past there has been no universal API for
+controlling their behavior---each system has had its own low-level way
+of accessing them. To help you write portable programs GSL allows you
+to specify modes in a platform-independent way using the environment
+variable @code{GSL_IEEE_MODE}. The library then takes care of all the
+necessary machine-specific initializations for you when you call the
+function @code{gsl_ieee_env_setup}.
+
+@deftypefun void gsl_ieee_env_setup ()
+This function reads the environment variable @code{GSL_IEEE_MODE} and
+attempts to set up the corresponding specified IEEE modes. The
+environment variable should be a list of keywords, separated by
+commas, like this,
+
+@display
+@code{GSL_IEEE_MODE} = "@var{keyword},@var{keyword},..."
+@end display
+
+@noindent
+where @var{keyword} is one of the following mode-names,
+
+@itemize @asis
+@item
+@code{single-precision}
+@item
+@code{double-precision}
+@item
+@code{extended-precision}
+@item
+@code{round-to-nearest}
+@item
+@code{round-down}
+@item
+@code{round-up}
+@item
+@code{round-to-zero}
+@item
+@code{mask-all}
+@item
+@code{mask-invalid}
+@item
+@code{mask-denormalized}
+@item
+@code{mask-division-by-zero}
+@item
+@code{mask-overflow}
+@item
+@code{mask-underflow}
+@item
+@code{trap-inexact}
+@item
+@code{trap-common}
+@end itemize
+
+If @code{GSL_IEEE_MODE} is empty or undefined then the function returns
+immediately and no attempt is made to change the system's IEEE
+mode. When the modes from @code{GSL_IEEE_MODE} are turned on the
+function prints a short message showing the new settings to remind you
+that the results of the program will be affected.
+
+If the requested modes are not supported by the platform being used then
+the function calls the error handler and returns an error code of
+@code{GSL_EUNSUP}.
+
+When options are specified using this method, the resulting mode is
+based on a default setting of the highest available precision (double
+precision or extended precision, depending on the platform) in
+round-to-nearest mode, with all exceptions enabled apart from the
+@sc{inexact} exception. The @sc{inexact} exception is generated
+whenever rounding occurs, so it must generally be disabled in typical
+scientific calculations. All other floating-point exceptions are
+enabled by default, including underflows and the use of denormalized
+numbers, for safety. They can be disabled with the individual
+@code{mask-} settings or together using @code{mask-all}.
+
+The following adjusted combination of modes is convenient for many
+purposes,
+
+@example
+GSL_IEEE_MODE="double-precision,"\
+ "mask-underflow,"\
+ "mask-denormalized"
+@end example
+
+@noindent
+This choice ignores any errors relating to small numbers (either
+denormalized, or underflowing to zero) but traps overflows, division by
+zero and invalid operations.
+@end deftypefun
+
+@noindent
+To demonstrate the effects of different rounding modes consider the
+following program which computes @math{e}, the base of natural
+logarithms, by summing a rapidly-decreasing series,
+@tex
+\beforedisplay
+$$
+e = 1 + {1 \over 2!} + {1 \over 3!} + {1 \over 4!} + \dots
+ = 2.71828182846...
+$$
+\afterdisplay
+@end tex
+@ifinfo
+
+@example
+e = 1 + 1/2! + 1/3! + 1/4! + ...
+ = 2.71828182846...
+@end example
+@end ifinfo
+
+@example
+@verbatiminclude examples/ieeeround.c
+@end example
+
+@noindent
+Here are the results of running the program in @code{round-to-nearest}
+mode. This is the IEEE default so it isn't really necessary to specify
+it here,
+
+@example
+$ GSL_IEEE_MODE="round-to-nearest" ./a.out
+i= 1 sum=1.000000000000000000 error=-1.71828
+i= 2 sum=2.000000000000000000 error=-0.718282
+....
+i=18 sum=2.718281828459045535 error=4.44089e-16
+i=19 sum=2.718281828459045535 error=4.44089e-16
+@end example
+
+@noindent
+After nineteen terms the sum converges to within @c{$4 \times 10^{-16}$}
+@math{4 \times 10^-16} of the correct value.
+If we now change the rounding mode to
+@code{round-down} the final result is less accurate,
+
+@example
+$ GSL_IEEE_MODE="round-down" ./a.out
+i= 1 sum=1.000000000000000000 error=-1.71828
+....
+i=19 sum=2.718281828459041094 error=-3.9968e-15
+@end example
+
+@noindent
+The result is about
+@c{$4 \times 10^{-15}$}
+@math{4 \times 10^-15}
+below the correct value, an order of magnitude worse than the result
+obtained in the @code{round-to-nearest} mode.
+
+If we change to rounding mode to @code{round-up} then the series no
+longer converges (the reason is that when we add each term to the sum
+the final result is always rounded up. This is guaranteed to increase the sum
+by at least one tick on each iteration). To avoid this problem we would
+need to use a safer converge criterion, such as @code{while (fabs(sum -
+oldsum) > epsilon)}, with a suitably chosen value of epsilon.
+
+Finally we can see the effect of computing the sum using
+single-precision rounding, in the default @code{round-to-nearest}
+mode. In this case the program thinks it is still using double precision
+numbers but the CPU rounds the result of each floating point operation
+to single-precision accuracy. This simulates the effect of writing the
+program using single-precision @code{float} variables instead of
+@code{double} variables. The iteration stops after about half the number
+of iterations and the final result is much less accurate,
+
+@example
+$ GSL_IEEE_MODE="single-precision" ./a.out
+....
+i=12 sum=2.718281984329223633 error=1.5587e-07
+@end example
+
+@noindent
+with an error of
+@c{$O(10^{-7})$}
+@math{O(10^-7)}, which corresponds to single
+precision accuracy (about 1 part in @math{10^7}). Continuing the
+iterations further does not decrease the error because all the
+subsequent results are rounded to the same value.
+
+@node IEEE References and Further Reading
+@section References and Further Reading
+
+The reference for the IEEE standard is,
+
+@itemize @asis
+@item
+ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic.
+@end itemize
+
+@noindent
+A more pedagogical introduction to the standard can be found in the
+following paper,
+
+@itemize @asis
+@item
+David Goldberg: What Every Computer Scientist Should Know About
+Floating-Point Arithmetic. @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 1
+(March 1991), pages 5--48.
+
+Corrigendum: @cite{ACM Computing Surveys}, Vol.@: 23, No.@: 3 (September
+1991), page 413. and see also the sections by B. A. Wichmann and Charles
+B. Dunham in Surveyor's Forum: ``What Every Computer Scientist Should
+Know About Floating-Point Arithmetic''. @cite{ACM Computing Surveys},
+Vol.@: 24, No.@: 3 (September 1992), page 319.
+@end itemize
+
+@noindent
+
+A detailed textbook on IEEE arithmetic and its practical use is
+available from SIAM Press,
+
+@itemize @asis
+@item
+Michael L. Overton, @cite{Numerical Computing with IEEE Floating Point Arithmetic},
+SIAM Press, ISBN 0898715717.
+@end itemize
+
+@noindent
+
+@comment to turn on math exception handling use __setfpucw, see
+@comment /usr/include/i386/fpu_control.h
+@comment e.g.
+@comment #include <math.h>
+@comment #include <stdio.h>
+@comment #include <fpu_control.h>
+@comment double f (double x);
+@comment int main ()
+@comment {
+@comment double a = 0;
+@comment double y, z;
+@comment __setfpucw(0x1372);
+@comment mention extended vs double precision on Pentium, and how to get around
+@comment it (-ffloat-store, or selective use of volatile)
+
+@comment On the alpha the option -mieee is needed with gcc
+@comment In Digital's compiler the equivalent is -ieee, -ieee-with-no-inexact
+@comment and -ieee-with-inexact, or -fpe1 or -fpe2