Instead of extracting patch descriptions, Dense-ContextDesc consumes image inputs, densely extracts feature representation, and uses Spatial Transformer Networks to crop a set of "feature patches" around the keypoint.
For example, using the scale and orientation parameters obtained from SIFT detector, we compose an affine transformation matrix, then crop a "feature patch" sized 8x8x128 around each keypoint location. The "feature patch" is then mapped to a 1x1x128 descriptor via a convolution with 8x8 filter. This process is similar to LF-Net, with scale and orientation pre-given.