KL散度和交叉熵CE

Veröffentlicht am 2019-07-11 in machine learning

KL散度

首先回顾熵的定义：

$H(x)=E_{x \sim p}[I(x)]$

其中$I[x]$是时间x=x的自信息：

$I[x]=-log \ p(x)$

自信息只处理单个输出，但是熵可以对整个分布的不确定性信息总量进行量化。

如果对于同一个随机变量x有两个单独的概率分布$P(x)$和$Q(x)$，可以使用KL散度来衡量这两个分布的差异。

$D_{KL}(P||Q)= E_{x \sim p}[log \frac{P(x)}{Q(x)}]=E_{x \sim p}[log \ P(x) - log \ Q(x)]$

需要注意的是$D_{KL}(P||Q)$不等于$D_{KL}(Q||P)$，KL散度并不是一个距离，因此他们有对称性。

KL散度的离散情况为：

$D_{KL}(P||Q)= \Sigma_x\ P(x) [log\frac{P(x)}{Q(x)}] \\ = \Sigma_x\ P(x) [log\ P(x)-log \ Q(x)] \\ = \Sigma_x\ P(x) \ log\ P(x) - \Sigma_x\ P(x)\ log \ Q(x) \\$

可以看的出来，KL散度表达的是一种编码在另一种编码表示下，所需要增加的熵的信息。

交叉熵

交叉熵的定义为：

$CE(P,Q)=-E_{x \sim p}[log \ Q(x)]$

离散情况为：

$CE(P,Q) = \Sigma_x\ P(x)logQ(x)$

它和KL散度很相似，二者只相差了左边的一项：

$CE(P,Q) = H(P) + D_{KL}(P||Q)$

需要注意的是，在机器学习中CE常被用作分类任务的loss函数。此时，由于样本的标签是固定的，则$ H(P)$的值是固定的，那么最小化CE就等价于最小化KL散度。这里的一个思想是吧样本的标签看作一个分布，样本的预测标签看成另一个分布。

Cyclical Learning Rates for Training Neural Networks

Veröffentlicht am 2019-07-02 in deep learning

paper details

深度学习中有一个常识是，学习率在训练的过程中需要逐渐减小。但是这篇文章却给出了一个让人惊讶的事实，就是训练过程中的学习率如果是多变（rise and fall）的是有益于训练的。因此作者建议学习率在一个范围内周期变化，而不是将其设置为固定值。

Cyclical Learning Rates来源于这么一个观察：学习率的增加虽然会带来短期的副作用但是长期来看是有益的。因此这种观察引出了让学习率在一定范围内变化而不是采用逐步固定或指数递减值的想法。即设置一个最大和最小的边界，然后学习率在里面循环变化。如下图的 triangular learning rate policy：

1557733855040

CLR能够发挥作用的一个直观理解是：最小化loss的困难在于如何逃离saddle点而不是在于差的局部最小值。在saddle 点的附近，梯度都很小因此学习的过程缓慢，因此通过增加学习率可以更快地走出saddle点区域。经验上的理由为什么CLR能够work是这样的：最佳的学习率可能在min-max boundaries之间，在最佳的学习率附近会被用于进行训练。（其他会被用于脱离saddle点。。。）。

除了上面显示的trangular learning rate policy,还有以下两种:

triangular2, 和triangular差不多，差别在于每一个cycle之后lr会减半。
exp_range，boundary的值会以一个指数因子衰减。

1557737380100

1557737358780

里面还讲了如何去估计一个cycle len的方法：

stepsize最好是2-10倍的每个epoch的迭代次数。对于CIFAR10来说，stepsize=8也就比stepsize=2效果好上一点点。

此外还讲了如何估计一个合理的min和max boundary

第一个方法就是：“LR range test”,模型先跑几个epoch，然后让lr从一个很小值增加到很大的值。然后画出accuracy versus learning rate.如下图：

1557735295110

注意图中的accuracy开始增加和accuracy开始变缓的时间段(或者accuracy开始下降)的地方。这两个地方是bound是的一个好的选择。即base_lr是第一个值，而max_lr是第二个值。或者说可以用一个经验，将base_lr设置为1/3或1/4的max_lr. 论文中作者选了base_lr = 0.001,而max_lr = 0.006

另外一个选择bound是的方法式画出loss versus learning rate的图，如下：

1557735640796

这张图中最适合的lr是哪里？不是在最低点，因为在最低点的lr已经有点大了。我们需要的是一个点更aggressive，所以我们能够train很快。即那个点loss下降是最快的

实验过程

做kaggle比赛的时候，clr的base_lr和max_lr设置反了，特别是在开始的一个stepsize里面，速度非常快，很容易就达到了一个很好的acc ，但是过了这个stepsize，acc就不断下降。一开始举得clclr的问题，后来突然发现是我输入的参数错误。有鉴于它收敛非常快，我觉得还是要借鉴下，发现lr的变化是这样的：

1557732898573

和clr差了一个stepsize。这个和SGDR很相似，准备用这个试试。

Reference

https://arxiv.org/pdf/1506.01186.pdf

https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html#how-do-you-find-a-good-learning-rate

https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0

https://www.paperweekly.site/papers/notes/598

TorchVison Image Transforms

Veröffentlicht am 2019-07-02 in deep learning

transforms主要是图像transform, 它们可以通过使用Compose来链接起来。

transforms.Compose([
  transforms.CenterCrop(10),
  transforms.ToTensor()
])

Transforms on PIL Image

`torchvision.transforms.CenterCrop`(size):

对给定的PIL image在中心处裁剪。

参数为：size, int or sequence. 如果是一个sequence，比如（h,w）会裁剪一个h*w大小的图片。

如果是int，那么会裁剪大小为（size，size）的图像

`torchvision.transforms.FiveCrop`(size)

对给定的PIL image的四个角和中心进行裁剪

其他同上。

>>> transform = Compose([
>>>    FiveCrop(size), # this is a list of PIL Images
>>>    Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor
>>> ])

`torchvision.transforms.Pad`(padding, fill=0, padding_mode=’constant’)

用给定的pad值对图像的4个sides进行填充

参数：padding: 用于确定每个border填充的数量.

如果只有一个int，对所有的边进行一样的填充数量

如果为长度为2的tuple，那么是对左右，上下分别指定

如果长度为4的tuple，那么是对左、上，右、下的边分别指定

fill: 当mode为constfill时的填充值。默认为0，如果是一个长度为3的tuple是，分别为RGB值

padding_mode:padding的类型

constant，常数填充

edge：用edge上的值进行填充

reflect：pads with reflection of image without repeating the last value on the edge

symmeic：pads with reflection of image repeating the last value on the edge

`torchvision.transforms.Grayscale`(num_output_channels=1)

将image转为灰度图

参数：num_output_channels ，默认为1，也可以为3, 是想要输出图像的channel的个数。

输出：输入的灰度版本。如果nums为1，那么返回的image是单channel，如果是3，返回的image的三个r、g、b三个通道相等。

输出的type：PIL image

`torchvision.transforms.Resize`(size, interpolation=2)

将输入的PILimage的大小resize到给定的大小

参数：size (sequence or int)期望的输出。如果size是int，那么短的边会匹配到这个数字。ie，如果height>height, 那么image会被缩放为(size*height/width, size). 如果size为sequence，那么大小会被匹配到给定的（h,w）。

interpolation: 插值的方法，默认为PIL.Image.BILINEAR

Transforms on torch.*Tensor

`torchvision.transforms.`Normalize(mean, std, inplace=False)

归一化给定的mean，std来归一化一张tensor image。对于每一个channel进行

$\frac{（input[channel - mean[channel]）}{std[channel]}$

参数：mean：每个channel的均值

std: 每个channel的std值

返回：normalized Tensor image

返回类型：Tensor

Note：不是就地改变输入Tensor

Conversion Transforms

`torchvision.transforms.ToPILImage`(mode=None)

将Tensor或者ndarray转换为PILimage

参数：mode:

如果mode没给定：

如果输入为4channel，那么默认为RGBA

如果输入为3channel，那么默认为RGB

如果输入为2channel，那么默认为LA

如果输入为1 channel，那么由mode参数确定

`torchvision.transforms.ToTensor`

将PIL image 或者ndarray转换为Tensor

将值范围为【0，255】的PIL image或者ndarray（H/W/C）转换为FloatTensor(C,H,W)并且值范围为【0.0，1.0】，如果the PIL Image属于 one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) 或者 the numpy.ndarray has dtype = np.uint8

其他的，tensors不会进行缩放

FiveCrop和TenCrop

这两种操作之后,一张图变成五张,一张图变成十张,那么在训练或者测试的时候怎么避免和标签混淆呢
思路是,这多个图拥有相同的标签,假如是分类任务,就可以使用交叉熵进行,然后求10张图的平均

transform = Compose([
    TenCrop(size), # this is a list of PIL Images
    Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor
])

#In your test loop you can do the following:
input, target = batch # input is a 5d tensor, target is 2d
bs, ncrops, c, h, w = input.size()
result = model(input.view(-1, c, h, w)) # fuse batch size and ncrops

result_avg = result.view(bs, ncrops, -1).mean(1) # avg over crops

长尾分布特征的处理

Veröffentlicht am 2019-07-02 in machine learning

对特征进行log处理

1557631665906

log后

1557631687832

在kaggle比赛中，不仅可以对特征进行这样的log矫正的，对目标值也可以进行这样的矫正。

Pytorch加载和读取模型

Veröffentlicht am 2019-07-02 in deep learning

首先看下有关的函数：

torch.save: 将一个文件保存到硬盘上，内部是用了pickle库
troch.load：用的pickle的unpicking方法将存储在硬盘上的object读取到内存中
torch.nn.Module.load_state_dict：从一个state_dict中加载一个模型的参数

什么是state_dict:

pytorch中的每个module的可学习的参数：如权重和bias等都在module.parameters()里面。

一个state_dict简单来说就是一个字典object，可以把每一层映射到他的参数上去。可学习参数以及register buffer(bn)已经优化器都有state_dict。因为state_dict是python字典对象，因此很简单就可以保存，修改。

下面是读取模型state_dict的例子：

1 2	for param_tensor in model.state_dict(): print(param_tensor, "\t", model.state_dict()[param_tensor].size())

两种方法

回到正题，有两种方法可以保存和读取模型。

第一种是通过模型的state_dict来进行读取和保存。特别是读取的时候，首先得新建一个模型object，然后加载参数。

Save:

1	torch.save(model.state_dict(), PATH)

Load:

1
2
3

model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()

第二种方法之别保存和加载整个模型：

Save:

1	torch.save(model, PATH)

Load:

1
2
3

# Model class must be defined somewhere
model = torch.load(PATH)
model.eval()

这种方法的缺点是序列化数据绑定到特定类以及保存模型时使用的确切目录结构。这是因为pickle不保存模型类本身。相反，它会保存包含类的文件的路径，该文件在加载时使用。因此，当您在其他项目中或在重构之后使用时，您的代码可能会以各种方式中断。

保存checkpoint

可以保存checkpoint用于后续的推理和重新训练。和单独保存模型的参数不同，优化器的参数也会被保存，以便于后续的训练。

Save:

torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            ...
            }, PATH)

Load:

model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

model.eval()
# - or -
model.train()

Reference

https://pytorch.org/tutorials/beginner/saving_loading_models.html

Pytorch TensorboardX 可视化

Veröffentlicht am 2019-07-02 Bearbeitet am 2019-07-13 in deep learning

安装tensorboard

1 2	pip install tensorboardX pip install tensorflow

使用

引入并创建一个SummaryWriter

from tensorboardX import SummaryWriter
writer = SummaryWriter('./runs/dogcat1')  //log_dir is ./run/dogcat
//need close
writer.close()

logdir参数要是不指定的话，会自动在生成run文件夹。另外还有一个comment参数，用于指定文件名称。

画loss曲线：

1	writer.add_scalar('loss',loss, epoch)

第一个参数为保存参数的名称，第二个参数为Y轴的值，第三个参数为X轴的值

运行该代码后，在log_dir下运行：

1	tensorboard --logdir log_dir //log_dir 为具体的文件夹

具体为：

1555679643005

结果为：

1555681461022

画激活情况

用于检查深层网络里面的层激活和权值分布情况，避免梯度消失等。

1
2
3

for name, param in net.named_parameters():
	writer.add_histogram(
		name, param.cpu().clone().data.numpy(), epoch_index)

需要注意的是，如果tensor在gpu需要将其转换到cpu中。

结果为：

1555682223014

画网络结构图

首先先对某个model进行实例化，如net。然后定义一个输入：

input = torch.rand(dim1,dim2,dim3,dim4)

net = LeNet()
writer.add_graph(net, input)

同样的，需要注意net要在cpu中。

效果如下：

1555682403877

显示图片

1	writer.add_image('name',image_object)

Projection

使用PCA，T-SNE等方法将高位向量投影到三维坐标系。默认使用PCA，也可以选择T-SNE

1	writer.add_embedding(mat, metadata=None, label_img=None, global_step=None, tag='default', metadata_header=None）

mat (torch.Tensor or numpy.array) – A matrix which each row is the feature vector of the data point
metadata (list) – A list of labels, each element will be convert to string
label_img (torch.Tensor) – Images correspond to each data point
global_step (int) – Global step value to record
tag (string) – Name for the embedding

参考资料

https://tensorboardx.readthedocs.io/en/latest/tensorboard.html

http://www.pianshen.com/article/3479170564/

https://github.com/pytorch/pytorch/issues/2731

Normalize的作用

Veröffentlicht am 2019-07-02 in deep learning

normalize的最主要的一个作用是将数据中的不同的特征缩放到同一个量纲上（或者可以说无量纲化）。比如果说有一个特征值的范围是[0,1]另一个特征的范围是[0,1000],那么优化算法（尤其是基于梯度的优化方法）在更新的时候尤其会重视特征值大的特征，而忽视特征值小的特征。为了避免这个问题就需要normalization了，把所有的特征放在一个量纲上。

常用的normalization的方法

主要有两种方法，min-max normalization 和 Z-score normalization。

min-max normalization

$x_{minmax} = \frac{x - x_{min}}{x_{max} - x_{min}}$

主要有两个缺陷:

新加入的数据会导致$x_{max}$和$x_{min}$ 会发生变化，需要重新定义
异常值会极大地影响minmax的表现
minmax不适用于长尾分布

比较适合于min和max固定的任务，比如图像像素归一化。

Z-score normalization

$x_{zscore} = \frac{x - min(x)}{stdev(x)}$

z-score的问题没有min-max多，对异常值也较为鲁棒性。且经过处理的数据会较为贴近正态分布（不是变为），大多数的数据会聚集在0附近，方差为1.

Caveat: it is a common misconception that standardized scores such as z-scores alter the shape of a distribution; in particular, be aware that a z*-scores cannot magically make a non-normal variable normal.

其他的还有logistic，lognormal，TanH等，见

Normalizing

和上面不同的方式，是直接对样本进行单位化，即

$x = \frac{x}{norm(x)}$

不同的norm会有不同的结果，常见的是L2 norm

PyTorch weight decay（转）

Veröffentlicht am 2019-07-02 in deep learning

torch.optim 中实现了很多优化器，只需要指定优化器的权重衰减即可：

1	optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9,weight_decay=1e-5)

优化器同时还支持per-parameter options操作，就是对每一个参数进行特定的制定，以满足更为细致的要求。此时，传入优化器的是可迭代的字典，字典中必须有params的key，用于指定特定优化变量，而其他key需要匹配优化器本身的设置。

optim.SGD([
                {'params': model.base.parameters()},
                {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)

可以灵活给每个子模块设置不同的学习率，权值衰减和momentum。也可以给权值设定权值衰减，而不作用于偏置：

weight_p, bias_p = [],[]
for name, p in model.named_parameters():
  if 'bias' in name:
     bias_p += [p]
   else:
     weight_p += [p]
# 这里的model中每个参数的名字都是系统自动命名的，只要是权值都是带有weight，偏置都带有bias，

optim.SGD([
          {'params': weight_p, 'weight_decay':1e-5},
          {'params': bias_p, 'weight_decay':0}
          ], lr=1e-2, momentum=0.9)

Reference

https://blog.csdn.net/LoseInVain/article/details/81708474

Softmax and Logsoftmax in Pytorch

Veröffentlicht am 2019-07-02 in deep learning

Output layer and criterion options (all are equivalent, 1 is most popular) :

Linear + LogSoftMax + ClassNLLCriterion
Linear + SoftMax + Log + ClassNLLCriterion
Linear + CrossEntropyCriterion

It should be noted that CrossEntropyLoss includes a softmax operation.

softmax with log-likelihood cost can be more fast compared with softmax with MSELoss.

The log-likelihood loss is

$C = - \Sigma_k y_klog(a_k)$

where $a_k$ is the output of a neuron, and $y_k$ is the truth.

The cross-entropy loss is

$C_{CE} = -\Sigma_k \ y_klog(a_k) + (1-y_k)log(1-a_k)$

And what’s the logsoftmax?

$Applies \ the \ `\log(\text{Softmax}(x))` function \ to \ an \ n-dimensional \ input \ Tensor. \\ The \ LogSoftmax \ formulation \ can \ be \ simplified \ as:\\ \text{LogSoftmax}(x_{i}) = \log\left(\frac{\exp(x_i) }{ \sum_j \exp(x_j)} \right)$

what’s more, it’s actually realized in nn.functional

$While \ mathematically \ equivalent \ to \ log(softmax(x)), \ doing \ these \\ two \ operations \ separately \ is \ slower, \ and \ numerically \ unstable.\\ This \ function \ \ uses \ an \ alternative \ formulation \ to \ compute \ the \ output \\ and \ gradient \ correctly.$

The NLLoss is:

$Loss \ = - w_nx_{n,y_n}$

where $w_n$ default is 1.

The BCELoss is a CrossEntropyLoss designed for binary classification. And it need a sigmoid function before useing the BCELoss. What’s more, BCEWithLogitsLoss includes the BCELoss and the sigmoid function.

References

https://github.com/torch/nn/issues/357

https://pytorch.org/docs/stable/_modules/torch/nn/functional.html#log_softmax

https://pytorch.org/docs/stable/nn.html?highlight=log_softmax#torch.nn.functional.log_softmax

Pytorch SGDR

Veröffentlicht am 2019-07-02 in deep learning

SGDR paper

学习率schedule最常见的方法是用一个lr，然后每隔几个epoch除以一个数来减少lr。如下图中的蓝色⚪线和红色

的方块线。

1557729884981

这篇论文所提出的方法是SGD的warm restart版本，即在每次restart，lr都被设置到初始值，但是他的上一次restart到下一次restart之间的距离（schedule）会增加。作者的经验表明，他的这个方法可以比其他的方法快2~4倍达到一个好的效果或者更好的效果。

warm started run SGD T_i 次，其中i是run的index。重要的是，重启不是从头开始执行，而是通过提高学习速率ηt来模拟，而旧的xt值用作初始解决方案

在第i次run，lr decay 是对每个batch用cosine annealing.

1557730343579

$\eta_{min} 和 \eta_{max}是学习率的范围。 \\ T_{cur}是距离上次restart所经过的epoch的数量。T_{cur}是每个batch增加，他可以是小数。 \\ 当t=0\ and\ T_{cur} = 0时，T_{cur}=T_{max} \\ 当T_{cur}=T_{max}时， cos 函数会输出-1.因此，\eta_t = \eta_{min}^i$

图1的绿色线、黑色线和灰色线显示了lr的变化过程。分别固定了$T_i$为50，100，200.

SGDR更进一步选了这么一方法，首先开始的时候$T_i$很小，然后在每次restart都通过乘上一个 $T_{mult}$的因此来提高。例如图一中的暗绿和粉色线。

SGDR in pytorch

pytorch只实现了CosineAnnealingLR，并没有实现restart部分。

torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)

1557731465430

它的用法如下：

model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=1.)
steps = 10
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, steps)

for epoch in range(5):
    for idx in range(steps):
        scheduler.step()
        print(scheduler.get_lr())

实际上，可以通过下面的方式来实现SGDR

model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=1.)
steps = 10
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, steps)

for epoch in range(5):
    for idx in range(steps):
        scheduler.step()
        print(scheduler.get_lr())
    
    print('Reset scheduler')
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, steps)

Reference

https://discuss.pytorch.org/t/how-to-implement-torch-optim-lr-scheduler-cosineannealinglr/28797/18

https://pytorch.org/docs/master/optim.html#torch.optim.lr_scheduler.CosineAnnealingLR

https://arxiv.org/pdf/1608.03983.pdf

KL散度

交叉熵

paper details

实验过程

Reference

Transforms on PIL Image

torchvision.transforms.CenterCrop(size):

torchvision.transforms.FiveCrop(size)

torchvision.transforms.Pad(padding, fill=0, padding_mode=’constant’)

torchvision.transforms.Grayscale(num_output_channels=1)

torchvision.transforms.Resize(size, interpolation=2)

Transforms on torch.*Tensor

torchvision.transforms.Normalize(mean, std, inplace=False)

Conversion Transforms

torchvision.transforms.ToPILImage(mode=None)

torchvision.transforms.ToTensor

FiveCrop和TenCrop

两种方法

保存checkpoint

Save:

Load:

Reference

安装tensorboard

使用

引入并创建一个SummaryWriter

画loss曲线：

画激活情况

画网络结构图

显示图片

Projection

参考资料

常用的normalization的方法

min-max normalization

Z-score normalization

Normalizing

Reference

References

SGDR paper

SGDR in pytorch

Reference

`torchvision.transforms.CenterCrop`(size):

`torchvision.transforms.FiveCrop`(size)

`torchvision.transforms.Pad`(padding, fill=0, padding_mode=’constant’)

`torchvision.transforms.Grayscale`(num_output_channels=1)

`torchvision.transforms.Resize`(size, interpolation=2)

`torchvision.transforms.`Normalize(mean, std, inplace=False)

`torchvision.transforms.ToPILImage`(mode=None)

`torchvision.transforms.ToTensor`