Under the Hood : Caffe

device_alternate.hpp

It diffrentiate the CPU only mode and GPCPU mode with ifdef CPU_ONLY. In the CPU_ONLY closure, some functions are stubed,like

template <typename Dtype> void classname<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>\& bottom, const vector<Blob<Dtype>*>\& top);
template <typename Dtype> void classname<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>\& top, const vector<bool>\& propagate_down, const vector<Blob<Dtype>*>\& bottom);
template <typename Dtype> void classname<Dtype>::funcname\#\#_\#\#gpu(const vector<Blob<Dtype>*>\& bottom, const vector<Blob<Dtype>*>\& top);
template <typename Dtype> void classname<Dtype>::funcname\#\#_\#\#gpu(const vector<Blob<Dtype>*>\& top, const vector<bool>\& propagate_down, const vector<Blob<Dtype>*>\& bottom).

Personally, I don't like these macro trick like ## which concates strings. When macro and template mess up, It's very hard to find what goes wrong.

If CPU_ONLY is not defined, there goes the mix mode. The device_alternate.hpp introduce all the cuda headers and some CUDA_CHECK macro,like

CUDA_CHECK(condition),
CUBLAS_CHECK(condition),
CURAND_CHECK(condition),

Aside from these CHECK macro, it introduce a CUDA_KERNEL_LOOP macro:

#define CUDA_KERNEL_LOOP(i, n) \
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
       i < (n); \
       i += blockDim.x * gridDim.x)

At the end of this file, it defines a block dispatch function which determines how many blocks every thread would consume:

inline int CAFFE_GET_BLOCKS(const int N)
{
    return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
}

The CAFFE_CUDA_NUM_THREADS depends on the _CUDA_ARCH: if the value is no less than 200, then the number of cuda threads is 1024, otherwise 512.

proto/caffe.pb.h

This file is generated by Google protobuf. It has all the definition of data stream structure and some get/set/serialize functions. Just the first glance of this header file, my eyes burn. Maybe my cat is lost in it.

util/rng.hpp

This header defines two functions, one called caffe_rng which return a rng_seed(boost::mt19937),the other one is a template shuffle functions. Actually these two functions just make no sense, why not just use c++11.

util/mkl_alternate.h

This header file works just like device_alternate.hpp, it diffrentiate MKL and other BLAS librarys. If we compile caffe with mkl, then there is nothing to do because MKL already define all the functions we need. Otherwise, we define some functions with macro. The macro works as below:

#define DEFINE_VSL_UNARY_FUNC(name, operation) \
  template<typename Dtype> \
  void v##name(const int n, const Dtype* a, Dtype* y) { \
    CHECK_GT(n, 0); CHECK(a); CHECK(y); \
    for (int i = 0; i < n; ++i) { operation; } \
  } \
  inline void vs##name( \
    const int n, const float* a, float* y) { \
    v##name<float>(n, a, y); \
  } \
  inline void vd##name( \
      const int n, const double* a, double* y) { \
    v##name<double>(n, a, y); \
  }

It introduce a template then instantiate the template with float and double, where vs stands for float vector and vd stands for double vector. After the macro definition, this file predefine some simple unary functions:

DEFINE_VSL_UNARY_FUNC(Sqr, y[i] = a[i] * a[i]);
DEFINE_VSL_UNARY_FUNC(Exp, y[i] = exp(a[i]));
DEFINE_VSL_UNARY_FUNC(Abs, y[i] = fabs(a[i]));

Aside from the vector transform template, this file also introduce a template which transform vector with additional parameter and atemplate of binary function:

#define DEFINE_VSL_UNARY_FUNC_WITH_PARAM(name, operation) \
  template<typename Dtype> \
  void v##name(const int n, const Dtype* a, const Dtype b, Dtype* y) { \
    CHECK_GT(n, 0); CHECK(a); CHECK(y); \
    for (int i = 0; i < n; ++i) { operation; } \
  } 
#define DEFINE_VSL_BINARY_FUNC(name, operation) \
  template<typename Dtype> \
  void v##name(const int n, const Dtype* a, const Dtype* b, Dtype* y) { \
    CHECK_GT(n, 0); CHECK(a); CHECK(b); CHECK(y); \
    for (int i = 0; i < n; ++i) { operation; } \
  }

And then instantiate these template with some simple functions,like:

DEFINE_VSL_BINARY_FUNC(Add, y[i] = a[i] + b[i])
;
DEFINE_VSL_BINARY_FUNC(Sub, y[i] = a[i] - b[i])
;
DEFINE_VSL_BINARY_FUNC(Mul, y[i] = a[i] * b[i])
;
DEFINE_VSL_BINARY_FUNC(Div, y[i] = a[i] / b[i])
;

In addition, MKL comes with an additional function axpby that is not present in standard blas. caffe simply use a two-step (inefficient, of course) way to mimic that.

inline void cblas_saxpby(const int N, const float alpha, const float* X,
        const int incX, const float beta, float* Y, const int incY)
{
    cblas_sscal(N, beta, Y, incY);
    cblas_saxpy(N, alpha, X, incX, Y, incY);
}
inline void cblas_daxpby(const int N, const double alpha, const double* X,
        const int incX, const double beta, double* Y, const int incY)
{
    cblas_dscal(N, beta, Y, incY);
    cblas_daxpy(N, alpha, X, incX, Y, incY);
}

where cblas_sscal, cblas_saxpy may defined as such:

void cblas_sscal(int N, float beta,float* Y, int incY)
{
    for(int i=0;i<N;i++)
    {
        Y[i*incY]*=beta;
    }
}
void cblas_saxpy(int N, float alpha, const float* X, int incX,float* Y,int incY)
{
    for(int i=0;i<N;i++)
    {
        Y[i*incY]=alpha*X[i*incX]+Y[i*incY];
    }
}

So the saxpby is essentially \(y=a*x+b*y\).

util/math_fucntions

In this header file, caffe declares all the BLAS functions(the cpu version and gpu version) the CNN needs. Here are some interesting function names without device prefix:

gemm, which stands for general matrix matrix multiply,
gemv, which stands for general matrix vector multiply,
axpy, which stands for \(y=a*x+y\),
axpby, which stands for \(y=a*x+b*y\),
scal, which stands for\(y=a*y\),
asum, which stands for \(\sum{abs(y)}\),
stride_dot, which stands for \(\sum{y[i*incY]*x[i*incX]}\),

These function would call respective cblas or cuda_blas. So these declarations are just encapsulation. Aside from these blas functions, some random generator(gaussian and uniform) and simple math function (like div,sub )are declared.All the definition details are in the math_functions.cu and math_functions.cpp.

common.hpp

This header file defines some common macro and the caffe class. The caffe class is a singleton class which ia accessed by the get fucntion.

inline static Caffe& Get()
{
    if (!singleton_.get())
    {
        singleton_.reset(new Caffe());
    }
    return *singleton_;
}

Personally I don't like this singleton implenmentation, we can just use static to achieve the same while keep the shared_ptr away.

static Caffe& get()
{
    static Caffe instance;
    return instance;
}

There is a rng class declared in caffe to generate random numbers.

syncedmem.hpp

This header file wraps the CPU/GPU memory malloc/free function and the synchronization between GPU memory and CPU memory. The class which wraps the sync-memory is declared as such:

class SyncedMemory
{
    private:
        void to_cpu();
        void to_gpu();
        void* cpu_ptr_;
        void* gpu_ptr_;
        size_t size_;
        SyncedHead head_;
        bool own_cpu_data_;
        DISABLE_COPY_AND_ASSIGN(SyncedMemory);
};

For a sync-memory, there has to be two blocks of memory, cpu_ptr malloced on CPU and gpu_ptr the other malloced on GPU, both have the size size_. The to_cpu and to_gpu function do the synchonization work.

The SyncdHead is a enum that indicates which block has the updated value, it is defined as such:

enum SyncedHead
{
    UNINITIALIZED, HEAD_AT_CPU, HEAD_AT_GPU, SYNCED
};

The DISABLE_COPY_AND_ASSIGN macro simply declares the copy and assign function private, while template would make it more comprehensible.

Caffe use SyncedMem class to synchronize values between the CPU and GPU in order to hide the synchronization details and to minimize data transfer. A rule of thumb is, always use the const call if you do not want to change the values, and never store the pointers in your own object. Every time you work on a blob, call the functions to get the pointers, as the SyncedMem will need this to figure out when to copy data.

In practice when GPUs are present, one loads data from the disk to a blob in CPU code, calls a device kernel to do GPU computation, and ferries the blob off to the next layer, ignoring low-level details while maintaining a high level of performance. As long as all layers have GPU implementations, all the intermediate data and gradients will remain in the GPU.

util/db.hpp

This file simply serve as a leveldb wrapper, nothing to talk about. But why does caffe need to use leveldb?

util/io.hpp

The io.hpp's main concern is file io, something like proto/Datum/CVMat serialization/deserialization and temp file management. The implementation are trival, so i just list the function names it delared.

inline void MakeTempFilename(string* temp_filename)
inline void MakeTempDir(string* temp_dirname)
inline void (Read/Write)Proto(From/to)(Text/Binary)File(...)
inline bool Read(File/Image)ToDatum

Currently I don't know what a datum or CVMat mean, we will see.

internal_thread.hpp

Yet another library wrapper, just like the leveldb wrapper. This file wrapper the boost.thread to a InternalThread Class. Why not just use C++11?

The Main Structure of Caffe

blob.hpp

This header file defines the Blob class.A Blob is a wrapper over the actual data being processed and passed along by Caffe, and also under the hood provides synchronization capability between the CPU and the GPU. Mathematically, a blob is a 4-dimensional array that stores things in the order of (Num, Channels, Height and Width), from major to minor, and stored in a C-contiguous fashion. The main reason for putting Num (the name is due to legacy reasons, and is equivalent to the notation of “batch” as in minibatch SGD).

Number is the batch size of the data. Batch processing achieves better throughput for communication and device processing. For an ImageNet training batch of 256 images Number = 256.
Channel is the feature dimension e.g. for RGB images Channel = 3.

Caffe stores and communicates data in 4-dimensional arrays called blobs. Blobs provide a unified memory interface, holding data e.g. batches of images, model parameters, and derivatives for optimization.The conventional blob dimensions for data are number N x channel K x height H x width W. Blob memory is row-major in layout so the last / rightmost dimension changes fastest. For example, the value at index (n, k, h, w) is physically located at index ((n * K + k) * H + h) * W + w.

Blob stores two chunks of memories, data and diff. The former is the normal data that we pass along, and the latter is the gradient computed by the network. The two chunks of memories is declared as such:

std::shared_ptr<SyncedMemory> data_;
std::shared_ptr<SyncedMemory> diff_;

The memories is allocated as such:

template<typename Dtype>
void Blob<Dtype>::Reshape(const std::vector<int>& shape)
{
    CHECK_LE(shape.size(), kMaxBlobAxes);
    count_ = 1;
    shape_.resize(shape.size());
    for (int i = 0; i < shape.size(); ++i)
    {
        CHECK_GE(shape[i], 0);
        count_ *= shape[i];
        shape_[i] = shape[i];
    }
    if (count_ > capacity_)
    {
        capacity_ = count_;
        data_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
        diff_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
    }
}

In the blob.cpp implementation details, the most important function is Update. If the newest data is on GPU side, the Update function calls the caffe_gpu_axpy, otherwise calls the caffe_axpy.here is the definition of Update:

void Blob<Dtype>::Update()
{
// We will perform update based on where the data is located.
switch (data_->head())
{
case SyncedMemory::HEAD_AT_CPU:
    // perform computation on CPU
    caffe_axpy<Dtype>(count_, Dtype(-1),
            static_cast<const Dtype*>(diff_->cpu_data()),
            static_cast<Dtype*>(data_->mutable_cpu_data()));
    break;
case SyncedMemory::HEAD_AT_GPU:
case SyncedMemory::SYNCED:
#ifndef CPU_ONLY
    // perform computation on GPU
    caffe_gpu_axpy<Dtype>(count_, Dtype(-1),
            static_cast<const Dtype*>(diff_->gpu_data()),
            static_cast<Dtype*>(data_->mutable_gpu_data()));
#else
    NO_GPU;
#endif
    break;
default:
    LOG(FATAL)<< "Syncedmem not initialized.";
}
}

filler.hpp

The filler header file does a really simple work: fill the blob with some data. The filled data comes from:

constant, 0 expected.
uniform distribution w.r.t. (a,b).
gaussian distribution w.r.t. \((\theta, \delta)\).
positive unit ball distribution
xavier distribution.

There is a factory function to generate specific filler class from protobuf txt file:

template<typename Dtype>
Filler<Dtype>* GetFiller(const FillerParameter& param)
{
    const std::string& type = param.type();
    if (type == "constant")
    {
        return new ConstantFiller<Dtype>(param);
    }
    else if (type == "gaussian")
    {
        return new GaussianFiller<Dtype>(param);
    }
    else if (type == "positive_unitball")
    {
        return new PositiveUnitballFiller<Dtype>(param);
    }
    else if (type == "uniform")
    {
        return new UniformFiller<Dtype>(param);
    }
    else if (type == "xavier")
    {
        return new XavierFiller<Dtype>(param);
    }
    else
    {
        CHECK(false) << "Unknown filler name: " << param.type();
    }
    return (Filler<Dtype>*) (NULL);
}

layer.hpp

The layer is the essence of a model and the fundamental unit of computation. Layers convolve filters, pool, take inner products, apply nonlinearities like rectified-linear and sigmoid and other elementwise transformations, normalize, load data, and compute losses like softmax and hinge.A layer takes input through bottom Blob and makes output through top Blob.

Each layer type defines three critical computations: setup, forward, and backward.

Setup: initialize the layer and its connections once at model initialization.
Forward: given input from bottom compute the output and send to the top.
Backward: given the gradient w.r.t. the top output compute the gradient w.r.t. to the input and send to the bottom. A layer with parameters computes the gradient w.r.t. to its parameters and stores it internally.

More specifically, there will be two Forward and Backward functions implemented, one for CPU and one for GPU. If you do not implement a GPU version, the layer will fall back to the CPU functions as a backup option. This may come handy if you would like to do quick experiments, although it may come with additional data transfer cost (its inputs will be copied from GPU to CPU, and its outputs will be copied back from CPU to GPU).

Layers have two key responsibilities for the operation of the network as a whole: a forward pass that takes the inputs and produces the outputs, and a backward pass that takes the gradient with respect to the output, and computes the gradients with respect to the parameters and to the inputs, which are in turn back-propagated to earlier layers. These passes are simply the composition of each layer’s forward and backward.

Layers are constructed from prototxt serialization fromat.The layer constructor take the LayerParameter and initiate.

explicit Layer(const LayerParameter& param) :
        layer_param_(param)
{
    // Set phase and copy blobs (if there are any).
    phase_ = param.phase();
    if (layer_param_.blobs_size() > 0)
    {
        blobs_.resize(layer_param_.blobs_size());
        for (int i = 0; i < layer_param_.blobs_size(); ++i)
        {
            blobs_[i].reset(new Blob<Dtype>());
            blobs_[i]->FromProto(layer_param_.blobs(i));
        }
    }
}

But the so-called construtor doesn't really construct the connections. It just do the blob memory allocation job. To initiate the connections, we should call the virtual void LayerSetUp function, whoes implemention details rely on the specific type of Layer. The layer class offer an entrance for LayerSetUp:

void SetUp(const vector<Blob<Dtype>*>& bottom,
            const vector<Blob<Dtype>*>& top)
{
    CheckBlobCounts(bottom, top);//check if the batch size equals
    LayerSetUp(bottom, top);
    Reshape(bottom, top);
    SetLossWeights(top);
}

The Reshape is a virtual function as well, whichhe adjust the shapes of top blobs and internal buffers to accomodate

the shapes of the bottom blobs.

There are five kinds of layer:

data layer: the data layer construct the input layer from different serialization of data, like hdf5,leveldb,lmdb etc.
neural layer: this kind of layer take exactly one input blob and one output blob. It's a simple transformation layer with sigmoid, tanh etc
common layer: Here goes the more complicated transformation, like ArgMaxLayer, ConcatLayer, FlattenLayer, SoftmaxLayer, SplitLayer etc.
loss layer: This layer computes the initial back-propagate gradient w.r.t. sigmoid, SquareLoss, contrastiveLoss, HingeLoss etc.
vision layer: This layer manipulate some specific transformation of image/vision, like convolution.

All the layer details can be seen in the src/caffe/layers source files.

net.hpp

The net jointly defines a function and its gradient by composition and auto-differentiation. The composition of every layer’s output computes the function to do a given task, and the composition of every layer’s backward computes the gradient from the loss to learn the task. Caffe models are end-to-end machine learning engines.

The net is a set of layers connected in a computation graph – a directed acyclic graph (DAG) to be exact. Caffe does all the bookkeeping for any DAG of layers to ensure correctness of the forward and backward passes. A typical net begins with a data layer that loads from disk and ends with a loss layer that computes the objective for a task such as classification or reconstruction.

The net is defined as a set of layers and their connections in a plaintext modeling language.The models are defined in plaintext protocol buffer schema (prototxt) while the learned models are serialized as binary protocol buffer (binaryproto) .caffemodel files.

Model initialization is handled by Net::Init(). The initialization mainly does two things: scaffolding the overall DAG by creating the blobs and layers (all resource holded by shared_ptr), and calls the layers’ SetUp() function. It also does a set of other bookkeeping things, such as validating the correctness of the overall network architecture. Also, during initialization the Net explains its initialization by logging to INFO with the help of glog.

In the net.hpp file, a set of train/test functions are defined:

Dtype ForwardFromTo(int start, int end);
Dtype ForwardFrom(int start);
Dtype ForwardTo(int end);
const vector<Blob<Dtype>*>& Forward(const vector<Blob<Dtype>*>& bottom,
                                  Dtype* loss = NULL);
void Backward();
void BackwardFromTo(int start, int end);
void BackwardFrom(int start);
void BackwardTo(int end)
Dtype ForwardBackward(const vector<Blob<Dtype>*>& bottom) {
Dtype loss;
Forward(bottom, &loss);
Backward();
return loss;
}
void Update();

solver.hpp

The solver orchestrates model optimization by coordinating the network’s forward inference and backward gradients to form parameter updates that attempt to improve the loss. The responsibilities of learning are divided between the Solver for overseeing the optimization and generating parameter updates and the Net for yielding loss and gradients.

The Caffe solvers are Stochastic Gradient Descent (SGD), Adaptive Gradient (ADAGRAD), and Nesterov’s Accelerated Gradient (NESTEROV). As usually, there is a factory function to generate all kinds of solver:

template<typename Dtype>
Solver<Dtype>* GetSolver(const caffe::SolverParameter& param)
{
    caffe::SolverParameter_SolverType type = param.solver_type();

    switch (type)
    {
        case SolverParameter_SolverType_SGD:
            return new SGDSolver<Dtype>(param);
        case SolverParameter_SolverType_NESTEROV:
            return new NesterovSolver<Dtype>(param);
        case SolverParameter_SolverType_ADAGRAD:
            return new AdaGradSolver<Dtype>(param);
        default:
            LOG(FATAL) << "Unknown SolverType: " << type;
    }
    return (Solver<Dtype>*) NULL;
}

The solver:

scaffolds the optimization bookkeeping and creates the training network for learning and test network(s) for evaluation.
iteratively optimizes by calling forward / backward and updating parameters
(periodically) evaluates the test networks
snapshots the model and solver state throughout the optimization

During each iteration:

calls network forward to compute the output and loss
calls network backward to compute the gradients
incorporates the gradients into parameter updates according to the solver method
updates the solver state according to learning rate, history, and method

to take the weights all the way from initialization to learned model.

Beneath the forward/backward/update functions, what the sovler really do is to solve a optimization problem of loss minimization. We can define the problem as: to minimize

\begin{equation}L(W)=\frac{1}{\vert D \vert}\sum_{i}^{\vert D\vert}{f_w (X^{i})} +\lambda r(W)\end{equation}

w.r.t. dataset \(D\), where\(f_w(X^{i})\) is the loss on data instance \(X^i\) and \(r(W)\) is a regularization term with weight \(\lambda\).

The general optimization method is gradient descent. Because the \(D\) in machine learning is so large that it's impossible to calculate the global gradient. In practice, we use a stochastic approximation of objective, drawing a mini-batch of \(N<<\vert D\vert\) instance, which is

\begin{equation}L(W)\approx \frac{1}{N}\sum_{i}^{\vert N\vert}{f_w (X^{i})} +\lambda r(W)\end{equation}

The model computes \(f_W\) in the forward pass and the gradient \(\nabla f_W\) in the backward pass.The parameter update \(\nabla W\) is formed by the solver from the error gradient \(\nabla f_W\), the regularization gradient \(\nabla r(W)\), and other particulars to each method.

Stochastic gradient descent (solver_type: SGD) updates the weights\( W\) by a linear combination of the negative gradient \(\nabla L(W)\) and the previous weight update \(V_t\). The learning rate \(\alpha\) is the weight of the negative gradient. The momentum \(\mu\) is the weight of the previous update.

Formally, we have the following formulas to compute the update value \(V_{t+1}\) and the updated weights \(W_{t+1}\) at iteration \(t+1\), given the previous weight update \(V_t\) and current weights \(W_t\):

\begin{equation}\begin{aligned}V_{t+1}&=\mu V_t -\alpha \nabla L(W_t) \\W_{t+1}&=W_t+V_{t+1}\end{aligned}\end{equation}

The learning “hyperparameters” (\(\alpha\) and \(\mu\)) might require a bit of tuning for best results.

The adaptive gradient (solver_type: ADAGRAD) method is a gradient-based optimization method (like SGD) that attempts to “find needles in haystacks in the form of very predictive but rarely seen features,” in Duchi et al.’s words. Given the update information from all previous iterations \((\nabla L(W))_{t'}\) for \({t'}\in {1,2,...,t}\), the update formulas are as follows, specified for each component \(i\) of the weights \(W\):

\begin{equation}(W_{t+1})_i=(W_t)_i-\alpha \frac{(\nabla L(W_t))_i}{\sqrt{\sum_{t'=1}^{t}{(\nabla L(W_{t'}))^2_i}}}\end{equation}

Note that in practice, for weights \(W\in \mathcal{R}^d\), AdaGrad implementations (including the one in Caffe) use only \(\mathcal{O}(d)\) extra storage for the historical gradient information (rather than the \(\mathcal{O}(dt)\) storage that would be necessary to store each historical gradient individually).

Nesterov’s accelerated gradient (solver_type: NESTEROV) was proposed by Nesterov as an “optimal” method of convex optimization, achieving a convergence rate of \(\mathcal{O}(1/t^2)\) rather than the \(\mathcal{O}(1/t)\). Though the required assumptions to achieve the \(\mathcal{O}(1/t^2)\) convergence typically will not hold for deep networks trained with Caffe (e.g., due to non-smoothness and non-convexity), in practice NAG can be a very effective method for optimizing certain types of deep learning architectures, as demonstrated for deep MNIST autoencoders by Sutskever et al.

The weight update formulas look very similar to the SGD updates given above:

\begin{equation}\begin{aligned}V_{t+1}&=\mu V_t -\alpha \nabla L(W_t,\mu V_t)\\W_{t+1}&=W_t+V_{t+1}\end{aligned}\end{equation}

What distinguishes the method from SGD is the weight setting \(W\) on which we compute the error gradient \(\nabla L(W)\) – in NAG we take the gradient on weights with added momentum \(\nabla L(W_t+\mu V_t)\); in SGD we simply take the gradient \(\nabla L(W_t)\) on the current weights themselves.

The workflow of caffe

network initiation

Firstly, we construct all the network structure(which is essentially a solver) from a ptototxt file.

Solver<Dtype>::Solver(const string& param_file) :
        net_()
{
    SolverParameter param;
    ReadProtoFromTextFileOrDie(param_file, &param);
    Init(param);
}

The ReadProtoFromTextFileOrDie is a wrapper of protobuf, it would eventually execute

bool success = google::protobuf::TextFormat::Parse(input, proto);

So , don't bother yourself to dive into the mess of protobuf.

After we get all the parameters, we construct the nets,as such:

void Solver<Dtype>::Init(const SolverParameter& param)
{
    LOG(INFO)<< "Initializing solver from parameters: " << std::endl
    << param.DebugString();
    param_ = param;
    CHECK_GE(param_.average_loss(), 1) << "average_loss should be non-negative.";
    if (param_.random_seed() >= 0)
    {
        Caffe::set_random_seed(param_.random_seed());
    }
    // Scaffolding code
    InitTrainNet();
    InitTestNets();
    LOG(INFO) << "Solver scaffolding done.";
    iter_ = 0;
    current_step_ = 0;
}

In the Init, it calls two initiation functions InitTrainNet and InitTrainNet. Because the nets can be created by network parameter or a already-existed network file(which has all the weights as well as network structure). So the Init*Net must differentiate these cases. Here is the InitTrainNet definition.

void Solver<Dtype>::InitTrainNet()
{
    const int num_train_nets = param_.has_net() + param_.has_net_param()
            + param_.has_train_net() + param_.has_train_net_param();
    const string& field_names = "net, net_param, train_net, train_net_param";
    CHECK_GE(num_train_nets, 1)<< "SolverParameter must specify a train net "
    << "using one of these fields: " << field_names;
    CHECK_LE(num_train_nets, 1)<< "SolverParameter must not contain more than "
    << "one of these fields specifying a train_net: " << field_names;
    NetParameter net_param;
    if (param_.has_train_net_param())
    {
        LOG(INFO)<< "Creating training net specified in train_net_param.";
        net_param.CopyFrom(param_.train_net_param());
    }
    else if (param_.has_train_net())
    {
        LOG(INFO) << "Creating training net from train_net file: "
        << param_.train_net();
        ReadNetParamsFromTextFileOrDie(param_.train_net(), &net_param);
    }
    if (param_.has_net_param())
    {
        LOG(INFO)<< "Creating training net specified in net_param.";
        net_param.CopyFrom(param_.net_param());
    }
    if (param_.has_net())
    {
        LOG(INFO)<< "Creating training net from net file: " << param_.net();
        ReadNetParamsFromTextFileOrDie(param_.net(), &net_param);
    }
        // Set the correct NetState.  We start with the solver defaults (lowest
        // precedence); then, merge in any NetState specified by the net_param itself;
        // finally, merge in any NetState specified by the train_state (highest
        // precedence).
    NetState net_state;
    net_state.set_phase(TRAIN);
    net_state.MergeFrom(net_param.state());
    net_state.MergeFrom(param_.train_state());
    net_param.mutable_state()->CopyFrom(net_state);
    net_.reset(new Net<Dtype>(net_param));
}

In the definition body, all jobs are done by protobuf.

network trainning

The network trainning is a nondetermistic procedure, because the trainning phase would run forever. To choose the right time to end the trainning phase. Usually, there is one simple principle :we set a stop_iter that we can tolerate to limit the iteration times. So in the Solve function, which is the caffe execution body, it sets the iteration limit then calls the step function to do the real work.

template<typename Dtype>
void Solver<Dtype>::Step(int iters)
{
    vector<Blob<Dtype>*> bottom_vec;
    const int start_iter = iter_;
    const int stop_iter = iter_ + iters;
    int average_loss = this->param_.average_loss();
    vector<Dtype> losses;
    Dtype smoothed_loss = 0;

    for (; iter_ < stop_iter; ++iter_)
    {
        if (param_.test_interval() && iter_ % param_.test_interval() == 0
                && (iter_ > 0 || param_.test_initialization()))
        {
            TestAll();
        }

        const bool display = param_.display()
                && iter_ % param_.display() == 0;
        net_->set_debug_info(display && param_.debug_info());
        Dtype loss = net_->ForwardBackward(bottom_vec);
        if (losses.size() < average_loss)
        {
            losses.push_back(loss);
            int size = losses.size();
            smoothed_loss = (smoothed_loss * (size - 1) + loss) / size;
        }
        else
        {
            int idx = (iter_ - start_iter) % average_loss;
            smoothed_loss += (loss - losses[idx]) / average_loss;
            losses[idx] = loss;
        }
        if (display)
        {
            //some code to display some information of current iteration, like loss 
        }
        ComputeUpdateValue();
        net_->Update();

        // Save a snapshot if needed.
        if (param_.snapshot() && (iter_ + 1) % param_.snapshot() == 0)
        {
            Snapshot();
        }
    }
}

Every iteration, caffe

issues a test to all the test files every test_interval iterations, output some test info
calls net_->ForwardBackward to train a batch, which would call chained layer->Forward and layer->Backward eventually.
calls ComputeUpdateValue to set all the diffs of blobs. This functions is virtual, its implemetation depends on the specific Solver.
calls net_->Update to update the weight.