I’m performing LCA using the poLCA package in R and trying to calculate entropy, which for some of my models are outputting NaN.
Following example code used for entropy calculation:
> entropy<-function (p) sum(-p*log(p))
> error_prior <- entropy(lca2$P) # Class proportions model 2
> error_post <- mean(apply(lca2$posterior, 1, entropy) na.rm = TRUE)
> results[2,8] <- round(((error_prior - error_post) / error_prior), 3)
From the answer to this question Entropy output is NaN for some class solutions and not others I understand this to be an issue caused by zeros in the function for calculating entropy. The issue is resolved when adding na.omit to the entropy as follows:
entropy <- function (p) sum(na.omit(-p*log(p)))
My question is – is adding this na.omit to entropy calculations a technically accepted method for resolving this issue without affecting the integrity of the calculation?
When I run the entropy calculations with and without na.omit, around 1/3 of the values (obviously those with zeros somewhere in calculation of entropy) are altered… I’m now unsure if I should always be using na.omit in entropy function or whether there is another way of resolving this problem.
>Solution :
It is valid, but not transparent at first glance. The reason is that the mathematical limit of xlog(x) as x -> 0 is 0 (we can prove this using L’Hospital Rule). In this regard, the most robust definition of the function should be
entropy.safe <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- numeric(length(p))
safe <- p != 0
log.p[safe] <- log(p[safe])
sum(-p * log.p)
}
But simply dropping p = 0 cases gives identical results, because the entropy at p = 0 is 0 and contributes nothing anyway.
entropy <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- numeric(length(p))
sum(-p * log.p, na.rm = TRUE)
}
p <- seq(0, 1, 0.1)
entropy(p)
#[1] 2.455935
entropy.safe(p)
#[1] 2.455935