## Let's look at the binary representation of floating point numbers¶

In [1]:
function float32_as_binary(x::Float32)
bs = bitstring(x);
println(" ", bs[1:1], "    ", bs[2:9], "    ", bs[10:32]);
println("sgn   exponent           mantissa");
end

function binary_as_float32(sgn::String, exponent::String, mantissa::String)
@assert(length(sgn)==1);
@assert(length(exponent)==8);
@assert(length(mantissa)==23);
return reinterpret(Float32,parse(UInt32,sgn*exponent*mantissa;base=2));
end

Out[1]:
binary_as_float32 (generic function with 1 method)
In [2]:
x = Float32(3.0)

Out[2]:
3.0f0

$3 = 11_b = 1.1_b \times 10_b^1 = 1.1_b \times 10_b^{128 - 127} = 1.1_b \times 10_b^{10000000_b - 127}$

(sign) (exponent) (mantissa) 0 10000000 100000000000...0

In [3]:
float32_as_binary(x)

 0    10000000    10000000000000000000000
sgn   exponent           mantissa

In [4]:
float32_as_binary(Float32(3.5))

 0    10000000    11000000000000000000000
sgn   exponent           mantissa

In [18]:
binary_as_float32("0","00000000","00000000000000000000001")

Out[18]:
1.0f-45
In [22]:
reinterpret(Float64, UInt64(1))

Out[22]:
5.0e-324
In [23]:
Float32(1.0) / Float32(0.0)

Out[23]:
Inf32
In [24]:
Float32(0.0) / Float32(0.0)

Out[24]:
NaN32
In [27]:
sqrt(Float32(-1.0))

DomainError with -1.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

Stacktrace:
[1] throw_complex_domainerror(::Symbol, ::Float32) at ./math.jl:33
[2] sqrt(::Float32) at ./math.jl:557
[3] top-level scope at In[27]:1
In [28]:
Float32(1.0) / Float32(0.0) - Float32(1.0) / Float32(0.0)

Out[28]:
NaN32
In [29]:
neg0 = binary_as_float32("1","00000000","00000000000000000000000")

Out[29]:
-0.0f0
In [37]:
Float32(1.0) / neg0

Out[37]:
-Inf32
In [38]:
-Float32(0.0)

Out[38]:
-0.0f0

If $x = y$, does that mean that $f(x) = f(y)$?

In [31]:
x = -Float32(0.0);
y = Float32(0.0);
x == y

Out[31]:
true
In [36]:
x / x == x / x

Out[36]:
false
In [39]:
Float32(1.0) / Float32(3.0)

Out[39]:
0.33333334f0
In [42]:
float32_as_binary(0.33333334f0)

 0    01111101    01010101010101010101011
sgn   exponent           mantissa

In [40]:
float32_as_binary(Float32(1.0) / Float32(3.0))

 0    01111101    01010101010101010101011
sgn   exponent           mantissa

In [43]:
Float64(Float32(1.0) / Float32(3.0))

Out[43]:
0.3333333432674408
In [45]:
0.1f0

Out[45]:
0.1f0
In [46]:
Float64(0.1f0)

Out[46]:
0.10000000149011612
In [48]:
abs(Float32(1.0) / Float32(3.0) - Float64(1.0) / Float64(3.0)) / (Float64(1/3))

Out[48]:
2.9802322443206464e-8

machine epsilion: $1.2 \times 10^{-7}$

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

In [ ]:



## Why randomized rounding?¶

Randomized rounding often does a better job of preserving the mean of a large vector than nearest-neighbor rounding.

Let's suppose we have a dataset, we're going to quantize it into an 8-bit integer, and then measure the resulting mean.

In [ ]:
using Statistics

In [ ]:
N = 1024;
X_original = 0.1 * randn(N) .+ 3.4;

In [ ]:
mean(X_original)

In [ ]:
X_8bit_nearest = Int8.(round.(X_original));

In [ ]:
mean(X_8bit_nearest)

In [ ]:
X_8bit_randomized = Int8.(floor.(X_original .+ rand(N)));

In [ ]:
mean(X_8bit_randomized)

In [ ]:
abs(mean(X_8bit_randomized) - mean(X_original))

In [ ]:
abs(mean(X_8bit_nearest) - mean(X_original))

In [ ]:


In [ ]:


In [ ]: