rs-face-dlib Sample

Overview

This example demonstrates a very simple facial landmark detection using Dlib's machine learning algorithms, using depth data to implement basic anti-spoofing.

Note: This is just an example intended to showcase possible applications. The heuristics used here are very simple and basic, and can be greatly improved on.

Faces detected by the camera will show landmarks in either green or red:

Green landmarks indicate a "real" face with depth data corroborating expectations. E.g., the distance to the eyes should be greater than to the tip of the nose, etc.
Red landmarks indicate a "fake" face. E.g., a picture of a face, where the distance to each facial feature does not meet expectations

Note: faces should be forward-facing to be detectable

Implementation

To enable usage of librealsense frame data as a dlib image, a rs_frame_image class is introduced. No copying of frame data takes place.

cpp

rs_frame_image< dlib::rgb_pixel, RS2_FORMAT_RGB8 > image( color_frame );

Faces are detected in two steps:

Facial boundary rectangles are detected:

cpp

  dlib::frontal_face_detector face_bbox_detector = dlib::get_frontal_face_detector();
  ...
  std::vector< dlib::rectangle > face_bboxes = face_bbox_detector( image );

Each one is then annotated to find its landmarks:

cpp

  dlib::shape_predictor face_landmark_annotator;
  dlib::deserialize( "shape_predictor_68_face_landmarks.dat" ) >> face_landmark_annotator;
  ...
  std::vector< dlib::full_object_detection > faces;
  for( auto const & bbox : face_bboxes )
    faces.push_back( face_landmark_annotator( image, bbox ));

Note: A dataset (a trained model file) is required to run this sample. You can use a dataset of your choosing/choosing or download the 68-point trained model file from dlib's site.

The landmarks calculated are 68-point 2D coordinates in the color frame. See this picture for (1-based) indexes for each of the points.

Once available, landmarks are used to get average eye, nose, ear, mouth, and chin depths. The distances are compared with expected relationships:

cpp

if( nose_depth >= eye_depth )
    return false;
if( eye_depth - nose_depth > 0.07f )
    return false;
if( ear_depth <= eye_depth )
    return false;
if( mouth_depth <= nose_depth )
    return false;
if( mouth_depth > chin_depth )
    return false;

// All the distances, collectively, should not span a range that makes no sense. I.e.,
// if the face accounts for more than 20cm of depth, or less than 2cm, then something's
// not kosher!
float x = std::max( { nose_depth, eye_depth, ear_depth, mouth_depth, chin_depth } );
float n = std::min( { nose_depth, eye_depth, ear_depth, mouth_depth, chin_depth } );
if( x - n > 0.20f )
    return false;
if( x - n < 0.02f )
    return false;

Because depth data is used to correspond with landmark pixels from the color image, it is important that the depth and color frames from the camera are aligned:

cpp

rs2::align align_to_color( RS2_STREAM_COLOR );
...
rs2::frameset data = pipe.wait_for_frames();
data = align_to_color.process( data );       // Replace with aligned frames

The depth data is accessed directly, for speed, instead of relying depth_frame.get_distance() which may incur overhead.