I have always been puzzled and amazed by the idea of “embedding”. A high-dimensional space, such as a corpus for a language, can be represented using, say, only 50 dimensions. How amazing! This is a huge save in covariate space dimension compared to the one-hot encoding. In a previous course at UConn called Data Science in Action, I did some text classification based on one-hot encoding and tf-idf weighting of text messages after tokenization, but that was a rather naive application - there were 9376 words in a total of 5572 messages, and I did not try to lower the dimension of covariate space but applied a bunch of classification algorithms directly. The project is on GitHub.
Following my last post on using tensorflow for linear
regression,
in this post I am going to extend the scope to generalized linear models.
>>> import tensorflow as tf
>>> import tensorflow_probability as tfp
>>> tfd = tfp.distributions
>>> from tensorflow.keras import layers
Let us simulate some data first.
This image is particularly useful when you want to utilize some of the command
line tools that come conveniently with linux systems. Also, I manage my Python
packages using conda, and there are times when I just need a package temporarily
yet it is not available via conda. For example, I recently defended my thesis,
and would like to make ad academic pedigree for myself using the excellent
geneagrapher. The original code
was written in Python2 but my local machine has Python3. Therefore I
considered using the base Ubuntu docker image: