This involves background modelling (being able to "ignore" the background), foreground modelling (being able to extract some features of the foreground objects - how they are "human-like" etc.), and tracking (being able to relate one object in one scene to the the same object in another scene in a different position and pose).

See Hanzi's Homepage