Let's look at the binary representation of floating point numbers

In [49]:
function float32_as_binary(x::Float32)
    bs = bitstring(x);
    println(" ", bs[1:1], "    ", bs[2:9], "    ", bs[10:32]);
    println("sgn   exponent           mantissa");
end

function binary_as_float32(sgn::String, exponent::String, mantissa::String)
    @assert(length(sgn)==1);
    @assert(length(exponent)==8);
    @assert(length(mantissa)==23);
    return reinterpret(Float32,parse(UInt32,sgn*exponent*mantissa;base=2));
end
Out[49]:
binary_as_float32 (generic function with 1 method)
In [50]:
x = Float32(3.0)
Out[50]:
3.0f0

$3 = 11_b = 1.1_b \times 10_b^1 = 1.1_b \times 10_b^{128 - 127} = 1.1_b \times 10_b^{10000000_b - 127}$

In [94]:
float32_as_binary(x)
 0    10000000    10000000000000000000000
sgn   exponent           mantissa
In [86]:
float32_as_binary(Float32(3.5))
 0    10000000    11000000000000000000000
sgn   exponent           mantissa
In [103]:
binary_as_float32("1","11111111","00000000000000000000110")
Out[103]:
NaN32
In [104]:
Float32(1.0) / Float32(0.0)
Out[104]:
Inf32
In [105]:
Float32(0.0) / Float32(0.0)
Out[105]:
NaN32
In [106]:
sqrt(Float32(-1.0))
DomainError with -1.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

Stacktrace:
 [1] throw_complex_domainerror(::Symbol, ::Float32) at ./math.jl:33
 [2] sqrt(::Float32) at ./math.jl:557
 [3] top-level scope at In[106]:1
In [108]:
Float32(1.0) / Float32(0.0) - Float32(1.0) / Float32(0.0)
Out[108]:
NaN32
In [109]:
neg0 = binary_as_float32("1","00000000","00000000000000000000000")
Out[109]:
-0.0f0
In [112]:
Float32(1.0) / neg0
Out[112]:
-Inf32
In [113]:
-Float32(0.0)
Out[113]:
-0.0f0
In [114]:
Float32(1.0) / Float32(3.0)
Out[114]:
0.33333334f0
In [115]:
float32_as_binary(Float32(1.0) / Float32(3.0))
 0    01111101    01010101010101010101011
sgn   exponent           mantissa
In [117]:
abs(Float32(1.0) / Float32(3.0) - Float64(1.0) / Float64(3.0)) / (Float64(1/3))
Out[117]:
2.9802322443206464e-8

machine epsilion: $1.2 \times 10^{-7}$

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

In [ ]:

Why randomized rounding?

Randomized rounding often does a better job of preserving the mean of a large vector than nearest-neighbor rounding.

Let's suppose we have a dataset, we're going to quantize it into an 8-bit integer, and then measure the resulting mean.

In [68]:
using Statistics
In [69]:
N = 1024;
X_original = 0.1 * randn(N) .+ 3.4;
In [70]:
mean(X_original)
Out[70]:
3.3978972384469635
In [77]:
X_8bit_nearest = Int8.(round.(X_original));
In [78]:
mean(X_8bit_nearest)
Out[78]:
3.1552734375
In [80]:
X_8bit_randomized = Int8.(floor.(X_original .+ rand(N)));
In [81]:
mean(X_8bit_randomized)
Out[81]:
3.408203125
In [82]:
abs(mean(X_8bit_randomized) - mean(X_original))
Out[82]:
0.010305886553036547
In [83]:
abs(mean(X_8bit_nearest) - mean(X_original))
Out[83]:
0.24262380094696345
In [ ]:

In [ ]: