---
title: "transfo examples"
author: "Raymaekers, J. and Rousseeuw, P.J."
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{transfo examples}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  fig.width  = 5,
  fig.height = 5,
  fig.align  = 'center'
)
```

# Introduction

This file contains some examples of the `transfo` function to robustly transform
variables toward central normality


```{r}
library(cellWise)
library(robustHD) # for the TopGear data
```
***
# Small toy example

We start with a small toy example, in which standard normal data with one NA
is used. 
```{r, fig.width = 5, fig.height = 5}
set.seed(1)
X = rnorm(100) 
X[50] = NA
qqnorm(X)
```


We specify the transformation type as `"bestObj"`, which chooses between Box-Cox and
Yeo-Johnson based on the value of the objective function. When there are negative
values, Yeo-Johnson is chosen automatically. First, the classical MLE transformation
is fit, indicated by the `robust = FALSE` argument:
```{r}
ML.out <- transfo(X, type = "bestObj", robust=FALSE)
ML.out$lambdahat
```
The value of the transformation parameter is close to 1, indicating no transformation
is necesarry. Among the other outputs are the value of the objective function, 
the transformed data, estimates of the location and scale of the transformed data:
```{r, fig.width = 5, fig.height = 5}
ML.out$objective
ML.out$Y[45:55] 
ML.out$muhat[1:20] 
ML.out$sigmahat[1:20] 
qqnorm(ML.out$Y); abline(0,1)
```
Additionally, the function returns the weights of the observations and the
transformation type that was used. The NA observation (number 50) gets weight zero:
```{r, fig.width = 5, fig.height = 3.5}
plot(ML.out$weights)
ML.out$ttypes[1:20] 
```
We can now repeat the same procedure with the robust estimator. 
This is specified using the `robust = TRUE` argument. The results do not differ
too much from the classical MLE estimates on this dataset, since there are no 
outliers.

```{r, fig.width = 5, fig.height = 5}
RewML.out <- transfo(X, type = "bestObj", robust=TRUE)
RewML.out$lambdahat 
RewML.out$objective 
RewML.out$Y[45:55]
RewML.out$muhat 
RewML.out$sigmahat 
qqnorm(RewML.out$Y); abline(0,1)
```
Using the robust estimator, one large observation (number 61) also gets a weight of 0:

```{r, fig.width = 5, fig.height = 3.5}
plot(RewML.out$weights) 
X[61] 
RewML.out$ttypes 
```

In order to illustrate the same concepts for the Box-Cox transformation,
we take the exponential of the generated data. This yields a sample from a 
log-normal distribution.

```{r}
X = exp(X) 

```
We first consider the classical maximum likelihood estimator. We obtain a transformation parameter close to 0, which corresponds with the logarithmic
transformation.
```{r, fig.width = 5, fig.height = 5}
ML.out <- transfo(X, type = "BC", robust=FALSE)
ML.out$lambdahat 
ML.out$objective 
ML.out$Y[45:55]  
qqnorm(ML.out$Y); abline(0,1)
```

The weights are again all equal to 1, except for the NA (number 50) which gets weight zero:
```{r, fig.width = 5, fig.height = 3.5}
plot(ML.out$weights) 
ML.out$ttypes 

```
We now repeat the example with the robust transformation. The estimated transformation parameter is again fairly close to 0, as expected.
```{r, fig.width = 5, fig.height = 5}
RewML.out <- transfo(X, type = "bestObj", robust=TRUE)
RewML.out$lambdahat 
RewML.out$objective 
RewML.out$Y[45:55] 
qqnorm(RewML.out$Y); abline(0,1)
RewML.out$ttypes 
```
***
# Transform new data and transform back

We briefly illustrate the functions transfo_newdata() and tranfo_transformback. We start with transfo_newdata(). Given the output of transfo(), this function applies the fitted transformation to new data.


```{r, fig.width = 5, fig.height = 5}
set.seed(123); tempraw <- matrix(rnorm(2000), ncol = 2)
tempx <- cbind(tempraw[, 1],exp(tempraw[, 2]))
tempy <- 0.5 * tempraw[, 1] + 0.5 * tempraw[, 2] + 1
x <- tempx[1:900, ]
y <- tempy[1:900]
tx.out <- transfo(x, type = "bestObj")
tx.out$ttypes
tx.out$lambdahats
tx <- tx.out$Y
lm.out <- lm(y ~ tx)
summary(lm.out)
xnew <- tempx[901:1000, ]
xtnew <- transfo_newdata(xnew, tx.out)
yhatnew <- cbind(1, xtnew) %*% lm.out$coefficients 
plot(tempy[901:1000], yhatnew); abline(0, 1)
```

We now illustrat the transfo_transformback() function. Given a transformation fitted by transfo(), this function applies the inverse of this transformation to the input. 
```{r, fig.width = 5, fig.height = 5}
set.seed(123); x <- matrix(rnorm(2000), ncol = 2)
y <- sqrt(abs(0.3 * x[, 1] + 0.5 * x[, 2] + 4))
ty.out <- transfo(y, type = "BC")
ty.out$lambdahats
ty <- ty.out$Y
lm.out <- lm(ty ~ x)
yhat <- transfo_transformback(lm.out$fitted.values, ty.out)
plot(y, yhat); abline(0, 1)
```

# TopGear example

This dataset is part of the `robustHD` package. We use two of its variables
to illustrate the effect outliers can have on estimating a transformation
using maximum likelihood. The robust estimators yield more reasonable results.

```{r}
data(TopGear) 
``` 

First, we do some preprocessing to avoid zeroes. 

```{r}
CD.out   <- checkDataSet(TopGear)
colnames(CD.out$remX)
# remove the subjective variable `Verdict':
X        <- CD.out$remX[,-12] 
carnames <- TopGear[CD.out$rowInAnalysis, 1:2]
X        <- pmax(X, 1e-8) # avoids zeroes
``` 


Now we estimate transformation parameters (takes under a second) using
classical MLE as well as the reweighted MLE which is much more robust. 


```{r}
ML.out    <- transfo(X, type = "BC", robust=F)
RewML.out <- transfo(X, type = "BC", robust=T)
``` 


Now consider the transformation of the variable `MPG`.
```{r}
ML.out$lambdahat[7] 
RewML.out$lambdahat[7] 
```
The classical method estimates a transformation parameter of 
-0.11, which is close to the log transform.
The robust method barely transforms the original variable
with a transformation parameter of 0.84, relatively close to 1.
The QQplots below suggest that the robust transformation is much more reasonable:

```{r, fig.width = 5, fig.height = 5}
qqnorm(X[, 7], main = "", cex.lab = 1.5, cex.axis = 1.5)

qqnorm(ML.out$Y[, 7], main = "Classical transform",
       cex.lab = 1.5, cex.axis = 1.5)
abline(0, 1)

qqnorm(RewML.out$Y[, 7], main = "Robustly transformed",
       cex.lab = 1.5, cex.axis = 1.5)
abline(0, 1)
```

Now consider the transformation of the variable `Weight`.

```{r, fig.width = 5, fig.height = 5}
ML.out$lambdahat[8] 
RewML.out$lambdahat[8] 
``` 
Here, the situation is reversed: the classical estimator finds 
a transformation parameter (0.83) relatively close to one, 
whereas the robust transformation finds a parameter (0.09)
close to the log transform. The QQplots below again suggest that the robust transformation is much more reasonable:

```{r, fig.width = 5, fig.height = 5}
qqnorm(X[, 8], main = "Original variable", 
       cex.lab = 1.5, cex.axis = 1.5)

qqnorm(ML.out$Y[, 8], main = "Classical transform", 
       cex.lab = 1.5, cex.axis = 1.5)
abline(0, 1)

qqnorm(RewML.out$Y[, 8], main = "Robustly transformed", 
       cex.lab = 1.5, cex.axis = 1.5)
abline(0, 1)

```
The classical transformation does not fit as well as it attempts 
to push the 5 outliers into the fold at the expense of creating
skewness in the center of the data.

***

# Glass data example

In this example we study the glass data. It consists of spectra with 750 wavelengths of 180 archaeological glass samples.

```{r}
data("data_glass")
```

First, we do some preprocessing to avoid variables with very small scales (in fact, the first 13 variables have mad equal to zero). Afterwards, we focus on the first 500 wavelengths since this is where most of the activity occurs.

```{r}
X <- as.matrix(data_glass[, -c(1:13)])
X <- X[, 1:500]
Z <- scale(X, center=FALSE, robustbase::colMedians(X))
dim(Z)
```


Now we estimate transformation parameters using
classical MLE as well as the reweighted MLE which is much more robust.
This only takes a few seconds for the 500 variables.

```{r}
ML.out    <- transfo(Z, type = "YJ", robust=F)
RewML.out <- transfo(Z, type = "YJ", robust=T)
```

We now construct the cellmaps:

```{r, fig.width  = 8, fig.height = 4}
indcells_clas = which(abs(ML.out$Y) > sqrt(qchisq(0.99, 1)))
indcells_rob  = which(abs(RewML.out$Y) > sqrt(qchisq(0.99, 1)))
n = dim(ML.out$Y)[1]
d = dim(ML.out$Y)[2]; d
nrowsinblock = 5
rowlabels = rep("", floor(n/nrowsinblock));
ncolumnsinblock = 5
columnlabels = rep("",floor(d/ncolumnsinblock));
columnlabels[3] = "1";
columnlabels[floor(d/ncolumnsinblock)] = "500"

CM_clas = cellMap(ML.out$Y, 
                  nrowsinblock = nrowsinblock, 
                  ncolumnsinblock = ncolumnsinblock,
                  rowblocklabels = rowlabels,
                  columnblocklabels = columnlabels,
                  mTitle = "YJ transformed variables by ML",
                  rowtitle = "", columntitle = "wavelengths",
                  columnangle = 0) 
plot(CM_clas)

CM_rob = cellMap(RewML.out$Y, 
                 nrowsinblock = nrowsinblock, 
                 ncolumnsinblock = ncolumnsinblock, 
                 rowblocklabels = rowlabels,
                 columnblocklabels = columnlabels,
                 mTitle = "YJ transformed variables by RewML",
                 rowtitle = "", columntitle = "wavelengths",
                 columnangle = 0) 
plot(CM_rob)

# pdf("Glass_YJ_ML_RewML.pdf",width=8,height=8)
# gridExtra::grid.arrange(CM_clas, CM_rob,ncol=1)
# dev.off()
```

***

# DPOSS data example

As a last example, we analyze the DPOSS data. This is a random subset of 20'000 stars from the Digitized Palomar Sky Survey described by Odewahn et al (1998).

```{r}
data("data_dposs") # in package cellWise
n = nrow(data_dposs); n 
ncol(data_dposs) 
```

There are lots of missing values in this dataset. Therefore, we first do
some preprocessing to select the band of wavelengths which contain the 
fewest missing values.

```{r}
missmat = is.na(data_dposs)
sizemat = nrow(missmat)*ncol(missmat); sizemat 
100*sum(as.vector(missmat))/sizemat 

missrow = length(which(rowSums(missmat) > 0))
100*missrow/nrow(missmat) 

# Missingness by band:

# F band:
300*sum(as.vector(missmat[,1:7]))/sizemat           
100*length(which(rowSums(missmat[,1:7]) > 0))/20000 

# J band:
300*sum(as.vector(missmat[,8:14]))/sizemat           
100*length(which(rowSums(missmat[,8:14]) > 0))/20000  

# N band:
300*sum(as.vector(missmat[,15:21]))/sizemat           
100*length(which(rowSums(missmat[,15:21]) > 0))/20000  

# So typically the whole band is missing or not.
# We focus on the J band which has the most available rows.

indx = which(rowSums(missmat[,8:14]) ==0)
dpossJ = data_dposs[indx,8:14]
dim(dpossJ) 
```

A quick exploration of the skewness in the data shows that there is 
skewness in both directions:
```{r, fig.width = 5, fig.height = 5}
par(mfrow = c(1, 1))
boxplot(scale(dpossJ)) 
``` 

Now we fit the Yeo-Johnson transformation robustly and transform the 
data with the estimated parameters. The analysis shows that there are both
lambdas larger than 1 and smaller than 1, corresponding with the skewness in
both directions.

```{r, fig.width = 5, fig.height = 5}
transfoJ_YJ <- transfo(dpossJ, type = "YJ", robust = T)
plot(transfoJ_YJ$lambdahat) 
dpossJ_YJ = transfoJ_YJ$Y
```

We now apply cellwise robust PCA to further analyze the data. We perform the PCA using
both the original and the transformed data.
```{r}
DDCPars = list(fastDDC=F,fracNA=0.5)
MacroPCAPars = list(DDCpars=DDCPars,scale=TRUE,silent=T)
MacroPCAdpossJ = MacroPCA(dpossJ,k=4,MacroPCApars=MacroPCAPars) 
MacroPCAdpossJ_YJ = MacroPCA(dpossJ_YJ,k=4,MacroPCApars=MacroPCAPars) 
```

Let's make a plot of the scores of YJ-transformed data:
```{r, fig.width = 6, fig.height = 6}
MacroPCAdpossJ_YJ$scores[, 1] <- -MacroPCAdpossJ_YJ$scores[, 1] 
cols <- rep("black", dim(MacroPCAdpossJ$scores)[1])
cols[c(98, 10894)] <- "darkorange"
cols[which(MacroPCAdpossJ_YJ$SD > 14)] <- "skyblue3"
cols[which(MacroPCAdpossJ_YJ$SD > 25)] <- "firebrick"

# pdf("dpossJ_scores_YJ.pdf")
pairs(MacroPCAdpossJ_YJ$scores,gap=0,main="",pch=19,col=cols)
# dev.off()
```

We finally show the outlier maps on the transformed and the untransformed data:

```{r, fig.width = 6, fig.height = 5}
# pdf("dpossJ_outliermap_YJ.pdf",height=5,width=5)
outlierMap(MacroPCAdpossJ_YJ, title="Outlier map of transformed data", col=cols, labelOut=FALSE)
# dev.off()

# pdf("dpossJ_outliermap_rawdata.pdf",height=5,width=5)
outlierMap(MacroPCAdpossJ, title="Outlier map of raw data", col=cols, labelOut=FALSE)
# dev.off()
```