CNN经典模型总结

2018-09-04

字数统计: 5.6k字 | 阅读时长: 26分

× 文章目录

1. 一、前言
2. 二、经典CNN

一、前言

　　CNN模型在图像处理上取得的突破，推动了CV的进一步发展，同时也掀起了人工神经网络（也称深度学习）的第三次浪潮，前两次分别为1958年的感知机模型与1986年反向传播算法。尤其是ImageNet使得CNN模型不断推陈出新，造就了一个个经典的CNN结构，其在ImageNet数据集上预训练的模型，成为了在小数据集上finetune的首选。本文简单梳理常见的CNN结构，旨在对所学的知识进行回顾，以便以后复习查看。
　　

二、经典CNN

　　CNN从90年代Lenet开始，用于手写字体识别，奠定了现代卷积神经网络的基石之作。沉寂十年后，12年AlexNet在ILSVRC（ImageNet Large Scale Visual Recognition Challenge）比赛分类项目获得冠军，错误率低于传统方法近10个百分点。从此CNN模型发展一发不可收拾，从ZFNet到VGG，GoogLeNet再到ResNet和最近的DenseNet，网络越来越深，架构越来越复杂，解决梯度消失、过拟合、模型退化的方法也越来越巧妙。
　　
　　CNN一些计算公式：卷积核大小K，输入通道C_in，输出通道C_out，步长S，padding宽度P
　　feature map 大小计算公式：$ \lfloor{(I-k+2 * p)}\rfloor / S + 1$
　　感受野大小计算公式：最后一层感受野为前一层卷积核大小，前一层RF = (RF - 1) * S + K，不断迭代到第一层。
　　一层卷积参数计算：$C_{in} * K * K * C_{out} + C_{out}(偏置个数)$

1. 入门LeNet

　　提出：LeCun在1998年提出，用于解决手写数字(MNIST数据集)识别的视觉任务。如今MNIST数据集是每一个CVer入门必使用的数据，相当于Deep Learning的Hello world。论文《Gradient-Based Learning Applied to Document Recognition》首页
　　创新：定义了CNN的基本组件：卷积、激活函数、池化、全连接，是CNN模型的鼻祖。
　　
　　LeNet-5：5个可学习层的卷积神经网络（包含可训练参数的层数：2conv + 3FC）
　　输入层：尺寸统一归一化为32×32的单通道灰度值图像，矩阵表示为[32, 32, 1]。传统上，不将输入层视为网络层次结构之一
　　卷积层C1：输入图片I:32 x 32, 卷积核大小K：5 x 5, 卷积核个数O：6 卷积核矩阵表示为[1, 5, 5, 6], 步长S为1， padding为VALID
　　　　　　　输出featuremap大小计算公式：32-5+1 = 28，神经元数量：28 * 28 *6
　　　　　　　可训练参数：（5 *5+1) * 6（每个滤波器5 *5=25个unit参数和一个bias参数，一共6个滤波器）
　　池化层S2：输入：28 *28, 采样区域：2 * 2，步长S为2。
　　　　　　　输出大小：14　* 14
　　其余层计算方式同上。
　　keras实现：激活函数改为最新的relu。

def LeNet():
    model = Sequential()
    model.add(Conv2D(filters=6, kernel_size=(5,5), padding='valid', input_shape=(32, 32, 1), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(filters=16, kernel_size=(5,5), padding='valid', activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Flatten())
    model.add(Dense(120, activation='relu'))
    model.add(Dense(84, activation='relu'))
    model.add(Dense(10, activation='softmax'))

2. 开创AlexNet

　　提出：AlexNet在2012年取得ImageNet竞赛冠军，开创深度学习和卷积神经网络研究的新局面。论文《ImageNet Classification with Deep Convolution Neural Networks》
　　创新：1. 更深的网络，2. 数据增广， 3. ReLu激活函数， 4. Dropout， 5 LRN（被证明没啥用，被后续结构抛弃），5. 两个GPU并行训练
　　
　　AlexNet总共包含8层，5层卷积层 + 3层全连接层，最终softmax输出是1000类
　　1. AlexNet共8层，相比于LeNet-5层数有所增加，卷积神经网络总的流程没有变化。
　　2. AlexNet针对ImageNet数据集共1000类进行分类，输入图片大小为256 x 256 x 3的三通道彩色图片，为了提高模型泛化能力，同时避免过拟合，作者使用了随机裁剪的思路对原来256×256的图像进行随机裁剪，得到尺寸为224×224x3的图像，并进行随机翻转，相当于把训练集扩大了32x32x2=2048倍，再输入到网络训练。
　　3. 使用ReLU代替Sigmoid来加快SGD的收敛速度，同时避免梯度消失或梯度爆炸的问题。
　　4. Dropout原理类似于浅层学习算法的中集成算法，该方法通过让全连接层的神经元（该模型在前两个全连接层引入Dropout）以一定的概率失去活性（比如0.5）失活的神经元不再参与前向和反向传播，相当于约有一半的神经元不再起作用，前向传播时，神经元的输出将被设置为0，在误差反向传播时，传播到该神经元的值也为0。而在下次迭代中，所有神经元将会根据keep_prob被重新随机dropout。相当于每次迭代，神经网络的拓扑结构都会有所不同，这就会迫使神经网络不会过度依赖某几个神经元或者说某些特征，因此，神经元会被迫去学习更具有鲁棒性的特征。在测试的时候，所有的keep_prob都为1.0，也即关闭dropout，并让所有神经元的输出乘0.5。即：$w_{test}^{(l)} = pW^{(l)}$(主要是为了使得测试数据和训练数据是大致一样的。比如一个神经元的输出是x，那么在训练的时候它有p的概率参与训练，(1-p)的概率丢弃，那么它输出的期望是px+(1-p)0=px。因此测试的时候把这个神经元的权重乘以p可以得到同样的期望。)，Dropout的引用，有效缓解了模型的过拟合，dropout只用于全连接层。
　　没有Dropout的网络计算公式：
　　
　　采用Dropout的网络计算公式：
　　
　　5. Hinton等人认为LRN层模仿生物神经系统的侧抑制机制，对局部神经元的活动创建竞争机制，使得响应比较大的值相对更大，提高模型泛化能力。但是，后来的论文比如Very Deep Convolution Networks for Large-Scale Image Recognition（也就是提出VGG网络的文章）中证明，LRN对CNN并没有什么作用，反而增加了计算复杂度，因此，这一技术也不再使用了。
　　输入层：[227, 227, 3], (论文图224x224x3)有误
　　卷积层：卷积核大小K=11，个数：96，表示为: [3, 11, 11, 96], 步长S=4，padding：VALID，输出大小：(227 - 11) / 4 + 1 = 55， tensor维度：[55, 55, 96]
　　池化层：核大小为3，步长为2,(overlapping pooling，论文中提到，使用这种池化可以一定程度上减小过拟合现象。)，输出：(55 - 3) / 2 + 1 = 27，tensor维度：[27, 27, 96]
　　其余层计算方式同上。
　　keras实现：

def AlexNet():
    model = Sequential()  
    model.add(Conv2D(96,(11,11),strides=(4,4),input_shape=(227,227,3),padding='valid',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Conv2D(256,(5,5),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Flatten())  
    model.add(Dense(4096,activation='relu'))  
    model.add(Dropout(0.5))  
    model.add(Dense(4096,activation='relu'))  
    model.add(Dropout(0.5))  
    model.add(Dense(1000,activation='softmax'))

3. 调参ZFNet

　　提出：2013年ImageNet冠军模型，基本就是在AlexNet基础上进行了一些细节的改动，网络结构上并没有太大的突破。论文名称：《Visualizing and Understanding Convolutional Networks 》
　　创新：提出了一个新颖的可视化技术，“理解”中间的特征层和最后的分类器层，并且找到改进神经网络的结构的方法。
　　　　　Conv1从步长为4，大小为11的卷积核改变为步长为2，大小为7的卷积核，（224-7）/2 +1 = 110
　　　　　Conv3,4,5由384,384,256改变为512,1024,512。
　　

def ZFNet():
    model = Sequential()  
    model.add(Conv2D(96,(7,7),strides=(2,2),input_shape=(224,224,3),padding='valid',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Conv2D(256,(5,5),strides=(2,2),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Flatten())  
    model.add(Dense(4096,activation='relu'))  
    model.add(Dropout(0.5))  
    model.add(Dense(4096,activation='relu'))  
    model.add(Dropout(0.5))  
    model.add(Dense(1000,activation='softmax'))

4. 加宽GoogleNet

　　提出：2014的ImageNet分类任务上击败了VGG-Nets夺得冠军，除了单纯的加深网络深度（22层），参考Network in Network思想，引入了Inception结构代替了单纯的卷积+激活的传统操作，拓宽了网络的宽度。论文《Going deeper with convolutions》
　　创新：利用inception结构，在不增加计算负载的情况下，增加网络的宽度和深度。
　　　　　1x1卷积的使用来增加网络深度，同时达到降维和限制网络尺寸的作用。
　　　　　对inception结构中的所有滤波器都进行学习，固定的多个Gabor滤波器来进行多尺度处理的方法。
　　　　　运用Hebbian原理，后面的全连接层全部替换为简单的全局平均pooling，把全连接的网络变为稀疏连接。　　
　　　　　结构中包含3个LOSS单元，为了帮助网络的收敛，在中间层加入辅助计算的LOSS单元，让低层的特征也有很好的区分能力，这两个辅助LOSS单元的计算被乘以0.3，然后和最后的LOSS相加作为最终的损失函数来训练网络。
　　
　　Inception结构：卷积stride都是1，为了保持特征响应图大小一致，都用了零填充，每个卷积层后面都立刻接了个ReLU层。
　　　　1. 通过3×3的池化、以及1×1、3×3和5×5这三种不同尺度的卷积核，一共4种方式对输入的特征响应图做了特征提取。
　　　　2. 为了降低计算量。同时让信息通过更少的连接传递以达到更加稀疏的特性，采用1×1卷积核来实现降维（通道压缩　＋　参数较少），同时还增加了ReLU非线性激层。
　　
　　整个网络的参数表：可以看到参数总量并不大，但是计算次数是非常大的。
　　
　　GoogLeNet的Keras实现：

def Conv2d_BN(x, nb_filter,kernel_size, padding='same',strides=(1,1),name=None):
    if name is not None:
        bn_name = name + '_bn'
        conv_name = name + '_conv'
    else:
        bn_name = None
        conv_name = None

    x = Conv2D(nb_filter,kernel_size,padding=padding,strides=strides,activation='relu',name=conv_name)(x)
    x = BatchNormalization(axis=3,name=bn_name)(x)
    return x

def Inception(x,nb_filter):
    branch1x1 = Conv2d_BN(x,nb_filter,(1,1), padding='same',strides=(1,1),name=None)

    branch3x3 = Conv2d_BN(x,nb_filter,(1,1), padding='same',strides=(1,1),name=None)
    branch3x3 = Conv2d_BN(branch3x3,nb_filter,(3,3), padding='same',strides=(1,1),name=None)

    branch5x5 = Conv2d_BN(x,nb_filter,(1,1), padding='same',strides=(1,1),name=None)
    branch5x5 = Conv2d_BN(branch5x5,nb_filter,(1,1), padding='same',strides=(1,1),name=None)

    branchpool = MaxPooling2D(pool_size=(3,3),strides=(1,1),padding='same')(x)
    branchpool = Conv2d_BN(branchpool,nb_filter,(1,1),padding='same',strides=(1,1),name=None)

    x = concatenate([branch1x1,branch3x3,branch5x5,branchpool],axis=3)

    return x

def GoogLeNet():
    inpt = Input(shape=(224,224,3))
    #padding = 'same'，填充为(步长-1）/2,还可以用ZeroPadding2D((3,3))
    x = Conv2d_BN(inpt,64,(7,7),strides=(2,2),padding='same')
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Conv2d_BN(x,192,(3,3),strides=(1,1),padding='same')
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Inception(x,64)#256
    x = Inception(x,120)#480
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Inception(x,128)#512
    x = Inception(x,128)
    x = Inception(x,128)
    x = Inception(x,132)#528
    x = Inception(x,208)#832
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Inception(x,208)
    x = Inception(x,256)#1024
    x = AveragePooling2D(pool_size=(7,7),strides=(7,7),padding='same')(x)
    x = Dropout(0.4)(x)
    x = Dense(1000,activation='relu')(x)
    x = Dense(1000,activation='softmax')(x)
    model = Model(inpt,x,name='inception')
    return model

5. 堆砌VggNet

　　提出：VGGNet是由牛津大学VGG（Visual Geometry Group）提出，是2014年ImageNet竞赛定位任务的第一名和分类任务的第二名的中的基础网络。相当于加深版的AlexNet，论文《Very Deep Convolutional Networks for Large-Scale Visual Recognition》。
　　创新：VGG采用Pre-training的方式，这种方式在经典的神经网络中经常见得到，就是先训练一部分小网络，然后再确保这部分网络稳定之后，再在这基础上逐渐加深。表1从左到右体现的就是这个过程，并且当网络处于D阶段的时候，效果是最优的，因此D阶段的网络也就是VGG-16了！E阶段得到的网络就是VGG-19了！VGG-16的16指的是conv+fc的总层数是16，是不包括max pool的层数！
　　　　　所有卷积层有相同的配置，更小的卷积核大小，更小的步长，即卷积核大小为3x3，步长为1，填充为1；共有5个最大池化层，大小都为2x2，步长为2；共有三个全连接层，前两层都有4096通道，第三层共1000路及代表1000个标签类别；最后一层为softmax层；所有隐藏层后都带有ReLU非线性激活函数；
　　为什么用3x3的滤波器尺寸？
　　1. 这是能捕捉到各个方向(8领域)的最小尺寸。
　　2. 大的感受野可以通过3x3卷积层堆叠而得，如5x5的感受野可以由两层3x3卷积得到。
　　3. 层数增加，同时增加了Relu层，增强模型非线性拟合能力，提升了判别函数的识别能力。
　　4. 同样的感受野减少了参数。如3个3x3的卷积核，通道数为C，则参数为$3*(3*3*C*C)=27C^2$，而一个7x7的卷积核，通道数也为C，则参数为$(7 *7*C*C)=49C^2$。
　　下面这个图就是VGG-16的网络结构：
　　
　　VGG-16的结构非常整洁，深度较AlexNet深得多，里面包含多个conv->conv->max_pool这类的结构,VGG的卷积层都是same的卷积，即卷积过后的输出图像的尺寸与输入是一致的，它的下采样完全是由max pooling来实现。VGG网络后接3个全连接层，filter的个数（卷积后的输出通道数）从64开始，然后没接一个pooling后其成倍的增加，128、512，VGG的注意贡献是使用小尺寸的filter，及有规则的卷积-池化操作。
　　
　　VGG-16的Keras实现：

def VGG_16():   
    model = Sequential()
    
    model.add(Conv2D(64,(3,3),strides=(1,1),input_shape=(224,224,3),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(64,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(128,(3,2),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(128,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Flatten())
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000,activation='softmax'))
    
    return model

6. 更深ResNet

　　提出：2015年何恺明推出的ResNet在ISLVRC和COCO上横扫所有选手，获得冠军。ResNet在网络结构上做了大创新，而不再是简单的堆积层数，ResNet在卷积神经网络的新思路，绝对是深度学习发展历程上里程碑式的事件。
　　创新：层数非常深，已经超过百层
　　　　　引入残差单元来解决退化问题
　　两种mapping：1. identity mapping,称为shortcut connection 2. residual mapping 最后输出：$y=F(x)+x$
　　如果F(x)和x的channel个数不同怎么办？相同时：$y = F(x) + x$, 不相同时：$y = F(x) + Wx$,其中W是卷积操作，用来调整x的channel维度。
　　残差模块：左图是常规残差模块，有两个3×3卷积核卷积核组成，但是随着网络进一步加深，这种残差结构在实践中并不是十分有效。针对这问题，右图的“瓶颈残差模块”（bottleneck residual block）可以有更好的效果，它依次由1×1、3×3、1×1这三个卷积层堆积而成，这里的1×1的卷积能够起降维或升维的作用，从而令3×3的卷积可以在相对较低维度的输入上进行，以达到提高计算效率的目的。
　　
　　模型结构：VGG19, plain net, resnet(shortcut connection实线为通道数相同，虚线为通道数不同)
　　
　　ResNet50和ResNet101：检测，分割，识别等领域常使用的模型。
　　
　　ResNet101:101层可学习层：3 + 4 + 23 + 3 = 33个building block，每个block为3层，所以有33 x 3 = 99层，最后有个fc层(用于分类)，所以1（第一层） + 99 + 1 = 101层。
　　ResNet101在Faster RCNN中RPN和Fast RCNN的使用：
　　　
　　ResNet-50的Keras实现：

def Conv2d_BN(x, nb_filter, kernel_size, strides=(1, 1), padding='same', name=None):
    if name is not None:
        bn_name = name + '_bn'
        conv_name = name + '_conv'
    else:
        bn_name = None
        conv_name = None
    x = Conv2D(nb_filter, kernel_size, padding=padding, strides=strides, activation='relu', name=conv_name)(x)
    x = BatchNormalization(axis=3, name=bn_name)(x)
    return x
def Conv_Block(inpt, nb_filter, kernel_size, strides=(1, 1), with_conv_shortcut=False):
    x = Conv2d_BN(inpt, nb_filter=nb_filter[0], kernel_size=(1, 1), strides=strides, padding='same')
    x = Conv2d_BN(x, nb_filter=nb_filter[1], kernel_size=(3, 3), padding='same')
    x = Conv2d_BN(x, nb_filter=nb_filter[2], kernel_size=(1, 1), padding='same')
    if with_conv_shortcut:
        shortcut = Conv2d_BN(inpt, nb_filter=nb_filter[2], strides=strides, kernel_size=kernel_size)
        x = add([x, shortcut])
        return x
    else:
        x = add([x, inpt])
        return x
def ResNet50():
    inpt = Input(shape=(224, 224, 3))
    x = ZeroPadding2D((3, 3))(inpt)
    x = Conv2d_BN(x, nb_filter=64, kernel_size=(7, 7), strides=(2, 2), padding='valid')
    x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same')(x)
    x = Conv_Block(x, nb_filter=[64, 64, 256], kernel_size=(3, 3), strides=(1, 1), with_conv_shortcut=True)
    x = Conv_Block(x, nb_filter=[64, 64, 256], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[64, 64, 256], kernel_size=(3, 3))

    x = Conv_Block(x, nb_filter=[128, 128, 512], kernel_size=(3, 3), strides=(2, 2), with_conv_shortcut=True)
    x = Conv_Block(x, nb_filter=[128, 128, 512], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[128, 128, 512], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[128, 128, 512], kernel_size=(3, 3))

    x = Conv_Block(x, nb_filter=[256, 256, 1024], kernel_size=(3, 3), strides=(2, 2), with_conv_shortcut=True)
    x = Conv_Block(x, nb_filter=[256, 256, 1024], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[256, 256, 1024], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[256, 256, 1024], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[256, 256, 1024], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[256, 256, 1024], kernel_size=(3, 3))

    x = Conv_Block(x, nb_filter=[512, 512, 2048], kernel_size=(3, 3), strides=(2, 2), with_conv_shortcut=True)
    x = Conv_Block(x, nb_filter=[512, 512, 2048], kernel_size=(3, 3))
    x = Conv_Block(x, nb_filter=[512, 512, 2048], kernel_size=(3, 3))
    x = AveragePooling2D(pool_size=(7, 7))(x)
    x = Flatten()(x)
    x = Dense(1000, activation='softmax')(x)

    model = Model(inputs=inpt, outputs=x)
    return model

7. 密集DenseNet

　　提出：CVPR2017的oral，作者从feature入手，将feature利用到了极致，可以说DenseNet吸收了ResNet最精华的部分，并在此上做了更加创新的工作，使得网络性能进一步提升。论文：《Densely Connected Convolutional Networks》github
　　创新：减轻了vanishing-gradient（梯度消失）
　　　　　加强了feature的传递
　　　　　更有效地利用了feature
　　　　　一定程度上较少了参数数量　
　　DenseNet结构：
　　
　　如上结构，DenseNet主要包含DenseBlock和transition layer两个组成模块，
　　Dense Block结构：在传统的卷积神经网络中，如果你有L层，那么就会有L个连接，但是在DenseNet中，会有L(L+1)/2个连接，因为其每一层的输入都来自前面所有层的输出。与ResNet中的BottleNeck基本一致：BN-ReLU-Conv(1×1)-BN-ReLU-Conv(3×3) ，而一个DenseNet则由多个这种block组成。每个DenseBlock的之间层称为transition layers。
　　
　　transition layers，由BN−>Conv(1×1)−>averagePooling(2×2)组成。
　　DenseNet-121的Keras实现：

def DenseNet121(nb_dense_block=4, growth_rate=32, nb_filter=64, reduction=0.0, dropout_rate=0.0, weight_decay=1e-4, classes=1000, weights_path=None):
    '''Instantiate the DenseNet 121 architecture,
        # Arguments
            nb_dense_block: number of dense blocks to add to end
            growth_rate: number of filters to add per dense block
            nb_filter: initial number of filters
            reduction: reduction factor of transition blocks.
            dropout_rate: dropout rate
            weight_decay: weight decay factor
            classes: optional number of classes to classify images
            weights_path: path to pre-trained weights
        # Returns
            A Keras model instance.
    '''
    eps = 1.1e-5

    # compute compression factor
    compression = 1.0 - reduction

    # Handle Dimension Ordering for different backends
    global concat_axis
    if K.image_dim_ordering() == 'tf':
      concat_axis = 3
      img_input = Input(shape=(224, 224, 3), name='data')
    else:
      concat_axis = 1
      img_input = Input(shape=(3, 224, 224), name='data')

    # From architecture for ImageNet (Table 1 in the paper)
    nb_filter = 64
    nb_layers = [6,12,24,16] # For DenseNet-121

    # Initial convolution
    x = ZeroPadding2D((3, 3), name='conv1_zeropadding')(img_input)
    x = Convolution2D(nb_filter, 7, 7, subsample=(2, 2), name='conv1', bias=False)(x)
    x = BatchNormalization(epsilon=eps, axis=concat_axis, name='conv1_bn')(x)
    x = Scale(axis=concat_axis, name='conv1_scale')(x)
    x = Activation('relu', name='relu1')(x)
    x = ZeroPadding2D((1, 1), name='pool1_zeropadding')(x)
    x = MaxPooling2D((3, 3), strides=(2, 2), name='pool1')(x)

    # Add dense blocks
    for block_idx in range(nb_dense_block - 1):
        stage = block_idx+2
        x, nb_filter = dense_block(x, stage, nb_layers[block_idx], nb_filter, growth_rate, dropout_rate=dropout_rate, weight_decay=weight_decay)

        # Add transition_block
        x = transition_block(x, stage, nb_filter, compression=compression, dropout_rate=dropout_rate, weight_decay=weight_decay)
        nb_filter = int(nb_filter * compression)

    final_stage = stage + 1
    x, nb_filter = dense_block(x, final_stage, nb_layers[-1], nb_filter, growth_rate, dropout_rate=dropout_rate, weight_decay=weight_decay)

    x = BatchNormalization(epsilon=eps, axis=concat_axis, name='conv'+str(final_stage)+'_blk_bn')(x)
    x = Scale(axis=concat_axis, name='conv'+str(final_stage)+'_blk_scale')(x)
    x = Activation('relu', name='relu'+str(final_stage)+'_blk')(x)
    x = GlobalAveragePooling2D(name='pool'+str(final_stage))(x)

    x = Dense(classes, name='fc6')(x)
    x = Activation('softmax', name='prob')(x)

    model = Model(img_input, x, name='densenet')

    if weights_path is not None:
      model.load_weights(weights_path)

    return model


def conv_block(x, stage, branch, nb_filter, dropout_rate=None, weight_decay=1e-4):
    '''Apply BatchNorm, Relu, bottleneck 1x1 Conv2D, 3x3 Conv2D, and option dropout
        # Arguments
            x: input tensor 
            stage: index for dense block
            branch: layer index within each dense block
            nb_filter: number of filters
            dropout_rate: dropout rate
            weight_decay: weight decay factor
    '''
    eps = 1.1e-5
    conv_name_base = 'conv' + str(stage) + '_' + str(branch)
    relu_name_base = 'relu' + str(stage) + '_' + str(branch)

    # 1x1 Convolution (Bottleneck layer)
    inter_channel = nb_filter * 4  
    x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_x1_bn')(x)
    x = Scale(axis=concat_axis, name=conv_name_base+'_x1_scale')(x)
    x = Activation('relu', name=relu_name_base+'_x1')(x)
    x = Convolution2D(inter_channel, 1, 1, name=conv_name_base+'_x1', bias=False)(x)

    if dropout_rate:
        x = Dropout(dropout_rate)(x)

    # 3x3 Convolution
    x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_x2_bn')(x)
    x = Scale(axis=concat_axis, name=conv_name_base+'_x2_scale')(x)
    x = Activation('relu', name=relu_name_base+'_x2')(x)
    x = ZeroPadding2D((1, 1), name=conv_name_base+'_x2_zeropadding')(x)
    x = Convolution2D(nb_filter, 3, 3, name=conv_name_base+'_x2', bias=False)(x)

    if dropout_rate:
        x = Dropout(dropout_rate)(x)

    return x


def transition_block(x, stage, nb_filter, compression=1.0, dropout_rate=None, weight_decay=1E-4):
    ''' Apply BatchNorm, 1x1 Convolution, averagePooling, optional compression, dropout 
        # Arguments
            x: input tensor
            stage: index for dense block
            nb_filter: number of filters
            compression: calculated as 1 - reduction. Reduces the number of feature maps in the transition block.
            dropout_rate: dropout rate
            weight_decay: weight decay factor
    '''

    eps = 1.1e-5
    conv_name_base = 'conv' + str(stage) + '_blk'
    relu_name_base = 'relu' + str(stage) + '_blk'
    pool_name_base = 'pool' + str(stage) 

    x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_bn')(x)
    x = Scale(axis=concat_axis, name=conv_name_base+'_scale')(x)
    x = Activation('relu', name=relu_name_base)(x)
    x = Convolution2D(int(nb_filter * compression), 1, 1, name=conv_name_base, bias=False)(x)

    if dropout_rate:
        x = Dropout(dropout_rate)(x)

    x = AveragePooling2D((2, 2), strides=(2, 2), name=pool_name_base)(x)

    return x


def dense_block(x, stage, nb_layers, nb_filter, growth_rate, dropout_rate=None, weight_decay=1e-4, grow_nb_filters=True):
    ''' Build a dense_block where the output of each conv_block is fed to subsequent ones
        # Arguments
            x: input tensor
            stage: index for dense block
            nb_layers: the number of layers of conv_block to append to the model.
            nb_filter: number of filters
            growth_rate: growth rate
            dropout_rate: dropout rate
            weight_decay: weight decay factor
            grow_nb_filters: flag to decide to allow number of filters to grow
    '''

    eps = 1.1e-5
    concat_feat = x

    for i in range(nb_layers):
        branch = i+1
        x = conv_block(concat_feat, stage, branch, growth_rate, dropout_rate, weight_decay)
        concat_feat = merge([concat_feat, x], mode='concat', concat_axis=concat_axis, name='concat_'+str(stage)+'_'+str(branch))

        if grow_nb_filters:
            nb_filter += growth_rate

    return concat_feat, nb_filter

三、参考链接

CNN网络架构演进：从LeNet到DenseNet