Fei Panfeipan [at] umich.eduCV | Google Scholar |
I am a Research Fellow in EECS at University of Michigan
and fortunate to work with Prof. Stella X. Yu.
My research lies in Computer Vision and Machine Learning.
I am interested in developing large-scale learning algorithms
for visual tasks with strong generalizability, vigorous robustness, and minimal human supervision.
I obtained my Ph.D. degree in 2023 under the supervision from Prof. In So Kweon at KAIST.
I've received Innovation Fellowship from Qualcomm and
Ph.D. scholarship from BOSCH during my Ph.D. course.
MoDA: Leveraging Motion Prior from Videos for Advancing Unsupervised Domain
Adaptation in Semantic Segmentation.
Fei Pan, Xu Yin, Seokju Lee, Axi Niu, Sungeui Yoon, In So Kweon.
IEEE/CVF Computer Vision and Pattern Recognition Conference Workshop (CVPRW), 2024. [pdf][code]
Learning with Limited Labelled Data for Image and Video Understanding.
Best Paper Award
Unsupervised domain adaptation (UDA) is an effective approach to handle the
lack of annotations in the target domain for the semantic segmentation task.
In this work, we consider a more practical UDA setting where the target domain
contains sequential frames of the unlabeled videos which are easy to collect
in practice. A recent study suggests self-supervised learning of the object motion
from unlabeled videos with geometric constraints. We design a motion-guided domain
adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object
motion to learn effective representations in the target domain. MoDA differs from
previous methods that use temporal consistency regularization for the target domain frames.
Instead, MoDA deals separately with the domain alignment on the foreground and
background categories using different strategies. Specifically, MoDA contains foreground
object discovery and foreground semantic mining to align the foreground domain gaps by
taking the instance-level guidance from the object motion.
Additionally, MoDA includes background adversarial training which contains a background
category-specific discriminator to handle the background domain gaps.
Experimental results on multiple benchmarks highlight the effectiveness of
MoDA against existing approaches in the domain adaptive image segmentation and
domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in
conjunction with existing state-of-the-art approaches to further improve performance.
Key Words:
Unsupervised Domain Adaptation, Semantic Segmentation, Motion Understanding, Geometric Learning.
Read More
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object.
Chenshuang Zhang, Fei Pan, Junmo Kim, In So Kweon, Chengzhi Mao.
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024. [pdf] [code]
Highlight Poster
We establish rigorous benchmarks for visual perception robustness.
Synthetic images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific type of
evaluation over synthetic corruptions, backgrounds, and textures,
yet those robustness benchmarks are restricted in specified variations and have low synthetic quality.
In this work, we introduce generative model as a data source for synthesizing hard images that
benchmark deep models' robustness.
Leveraging diffusion models, we are able to generate images with more diversified backgrounds,
textures, and materials than any prior work, where we term this benchmark as ImageNet-D.
Experimental results show that ImageNet-D results in a significant accuracy drop
to a range of vision models, from the standard ResNet visual classifier to the
latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy
by up to 64%. Our work suggests that diffusion models can be an effective source to test vision models.
Key Words:
Diffusion Models, Large-Scale Vision and Language Models, Robustness and Generalization,.
Read More
Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models.
Fei Pan, Sangryul Jeon, Brian Wang, Frank Mckenna, Stella Yu.
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024. [pdf] [code] [poster]
Modern building recognition methods, exemplified by the BRAILS framework,
utilize supervised learning to extract information from satellite and
street-view images for image classification and semantic segmentation tasks.
However, each task module requires human-annotated data,
hindering the scalability and robustness to regional variations and annotation imbalances.
In response, we propose a new zero-shot workflow for building attribute extraction
that utilizes large-scale vision and language models to mitigate reliance on external annotations.
The proposed workflow contains two key components: image-level captioning and
segment-level captioning for the building images based on the vocabularies
pertinent to structural and civil engineering.
These two components generate descriptive captions by computing feature
representations of the image and the vocabularies,
and facilitating a semantic match between the visual and textual representations.
Consequently, our framework offers a promising avenue to enhance AI-driven
captioning for building attribute extraction in the structural and
civil engineering domains, ultimately reducing reliance on human annotations
while bolstering performance and adaptability.
Key Words:
Zero-shot Leanring, Building Attribute Extraction, Large-Scale Vision & Language Models.
Read More
Masking-augmented Collaborative Domain Congregation for
Multi-target Domain Adaptation in Semantic Segmentation.
Fei Pan*, Dong He*, Xu Yin, Chenshuang Zhang, Munchurl Kim.
IEEE Intelligent Vehicles Symposium (IV), 2024.
Best Paper Nominated
This paper addresses the challenges in multi-target domain adaptive segmentation
which aims at learning a single model that adapts to multiple diverse target domains.
Existing methods show limited performance as they only consider the difference in visual appearance (style)
while ignoring the (contextual) variations among multiple target domains.
In contrast, we propose a novel approach termed Masking-augmented Collaborative Domain Congregation (MacDC)
to handle the style gap and contextual gap altogether.
The proposed MacDC comprises two key parts: collaborative domain congregation (CDC) and multi-context masking consistency (MCMC).
Our CDC handles the style and contextual gaps among target domains by data mixing, which generates image-level and region-level
intermediate domains among target domains. To further strengthen contextual alignment,
our MCMC applies a masking-based self-supervised augmentation consistency that enforces the model's understanding of
diverse contexts together.
MacDC directly learns a single model for multi-target domain adaptation without requiring multiple network training and subsequent distillation.
Despite its simplicity, MacDC shows efficacy in mitigating the style and contextual gap among multiple target domains and demonstrates
superior performance on multi-target domain adaptation for segmentation benchmarks compared to existing state-of-the-art approaches.
Key Words: Multi-target Domain Adaptation, Semantic Segmentation, Masking Consistency, Self-supervised Data Augmentation.
Read More
CCTV-Calib: a Toolbox to Calibrate Surveillance Cameras Around the Globe.
Francois Rameau, Jaesung Choe, Fei Pan, Seokju Lee, In So Kweon.
Machine Vision and Applications, 2023. [pdf] [code]
In this paper, we propose CCTV-Calib, a user-friendly toolbox to calibrate
traffic cameras using satellite views.
Specifically, CCTV-Calib can estimate the intrinsic and extrinsic
parameters as well as the GPS location of one or multiple CCTV
cameras in a few clicks. Previous surveillance camera calibration
strategies rely on various assumptions on the camera parameters
(e.g., absence of radial distortion), location, or detected objects
in the scene. In contrast, our system is able to calibrate both
perspective and fisheye cameras without restrictive structural
or semantic assumptions. In fact, only a few correspondences
between an image and its satellite view are sufficient to accurately
calibrate a camera. Such kind of camera geo-localization and
calibration via satellite imaging has yet attracted narrow attention.
As a result, most existing techniques naively rely on manually
clicked keypoint correspondences between the satellite view and
the CCTV image, leading to poor accuracy and repeatability. To
cope with these limitations and to ease the calibration process, we
propose an automated keypoints matching stage and a refinement
process improving the accuracy of the computed parameters. Our
toolbox has been qualitatively and quantitatively evaluated using
synthetic and real data from various traffic cameras around the
globe. We made these unique datasets freely available to the
community. Finally, in order to illustrate the relevance of our
calibration strategy, we demonstrate its applicability to 3D vehicle
geolocalization. Our novel calibration pipeline is integrated in a
easy to use GUI and is freely available via the following link:
Key Words:
Camera Calibration, CCTV, Vehicle Geolocalization.
Read More
ML-BPM: Multi-teacher Learning with Bidirectional Photometric Mixing for Open
Compound Domain Adaptation in Semantic Segmentation.
Fei Pan, Sungsu Hur, Seokju Lee, Junsik Kim, In So Kweon.
European Conference on Computer Vision (ECCV), 2022. [pdf]
Open compound domain adaptation (OCDA) considers the target domain as the
compound of multiple unknown homogeneous subdomains.
The goal of OCDA is to minimize the domain gap between the labeled source domain
and the unlabeled compound target domain, which benefits the model generalization
to the unseen domains. Current OCDA for semantic segmentation methods adopt manual
domain separation and employ a single model to simultaneously adapt to all the
target subdomains. However, adapting to a target subdomain might hinder the model
from adapting to other dissimilar target subdomains, which leads to limited performance.
In this work, we introduce a multi-teacher framework with bidirectional photometric
mixing to separately adapt to every target subdomain. First, we present an automatic
domain separation to find the optimal number of subdomains. On this basis, we propose
a multi-teacher framework in which each teacher model uses bidirectional photometric
mixing to adapt to one target subdomain. Furthermore, we conduct an adaptive distillation
to learn a student model and apply consistency regularization to improve the student
generalization. Experimental results on benchmark datasets show the efficacy of the
proposed approach for both the compound domain and the open domains against existing
state-of-the-art approaches.
Key Words:
Domain Adaptation, Open Compound Domain Adaptation, Semantic Segmentation, Multi-teacher Distillation.
Read More
Labeling Where Adapting Fails: Cross-Domain Semantic Segmentation with Point
Supervised via Active Learning.
Fei Pan, Francois Rameau, Junsik Kim, In So Kweon.
arXiv, 2022. [pdf]
Training models dedicated to semantic segmentation requires a large amount
of pixel-wise annotated data. Due to their costly nature, these annotations
might not be available for the task at hand. To alleviate this problem,
unsupervised domain adaptation approaches aim at aligning the feature
distributions between the labeled source and the unlabeled target data.
While these strategies lead to noticeable improvements, their effectiveness
remains limited. To guide the domain adaptation task more efficiently, previous
works attempted to include human interactions in this process under the form of
sparse single-pixel annotations in the target data. In this work, we propose a
new domain adaptation framework for semantic segmentation with annotated points
via active selection. First, we conduct an unsupervised domain adaptation of the
model; from this adaptation, we use an entropy-based uncertainty measurement for
target points selection. Finally, to minimize the domain gap, we propose a domain
adaptation framework utilizing these target points annotated by human annotators.
Experimental results on benchmark datasets show the effectiveness of our methods
against existing unsupervised domain adaptation approaches. The propose pipeline
is generic and can be included as an extra module to existing domain adaptation strategies.
Key Words:
Active Learning, Unsupervised Domain Adaptation, Semantic Segmentation.
Read More
Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation.
Seokju Lee, Francois Rameau, Fei Pan, In So Kweon.
International Conference on Computer Vision (ICCV), 2021. [pdf] [code]
Estimating the motion of the camera together with the 3D
structure of the scene from a monocular vision system is a
complex task that often relies on the so-called scene rigidity
assumption. When observing a dynamic environment, this
assumption is violated which leads to an ambiguity between
the ego-motion of the camera and the motion of the objects.
To solve this problem, we present a self-supervised learning
framework for 3D object motion field estimation from
monocular videos. Our contributions are two-fold. First, we
propose a two-stage projection pipeline to explicitly disentangle
the camera ego-motion and the object motions with
dynamics attention module, called DAM. Specifically, we
design an integrated motion model that estimates the motion
of the camera and object in the first and second warping stages,
respectively, controlled by the attention module
through a shared motion encoder. Second, we propose an
object motion field estimation through contrastive sample
consensus, called CSAC, taking advantage of weak semantic
prior (bounding box from an object detector) and geometric
constraints (each object respects the rigid body motion
model). Experiments on KITTI, Cityscapes, and Waymo
Open Dataset demonstrate the relevance of our approach
and show that our method outperforms state-of-the-art algorithms
for the tasks of self-supervised monocular depth
estimation, object motion segmentation, monocular scene
flow estimation, and visual odometry.
Key Words:
Motion Field Estimation, Monocular Depth Prediction, Geometric Learning.
Read More
Two-phase Pseudo Label Densification for Self-training based Domain Adaptation.
Inkyu Shin, Sanghyun Woo, Fei Pan, In So Kweon.
European Conference on Computer Vision (ECCV), 2020. [pdf]
Recently, deep self-training approaches emerged as a powerful solution to
the unsupervised domain adaptation. The self-training scheme involves
iterative processing of target data; it generates target pseudo labels
and retrains the network. However, since only the confident predictions
are taken as pseudo labels, existing self-training approaches inevitably
produce sparse pseudo labels in practice. We see this is critical because
the resulting insufficient training-signals lead to a suboptimal,
error-prone model. In order to tackle this problem, we propose a novel
Two-phase Pseudo Label Densification framework, referred to as TPLD.
In the first phase, we use sliding window voting to propagate the confident
predictions, utilizing intrinsic spatial-correlations in the images.
In the second phase, we perform a confidence-based easy-hard classification.
For the easy samples, we now employ their full pseudo labels.
For the hard ones, we instead adopt adversarial learning to enforce hard-to-easy
feature alignment. To ease the training process and avoid noisy predictions,
we introduce the bootstrapping mechanism to the original self-training loss.
We show the proposed TPLD can be easily integrated into existing self-training
based approaches and improves the performance significantly.
Combined with the recently proposed CRST self-training framework, we achieve
new state-of-the-art results on two standard UDA benchmarks.
Key Words:
Self-training, Domain Adaptation, Pseudo Label Correction.
Read More
Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-supervision.
Fei Pan, Inkyu Shin, Francois Rameau, Seokju Lee, In So Kweon.
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2020. [pdf] [code]
Oral Presentation
Convolutional neural network-based approaches have achieved remarkable progress
in semantic segmentation. However, these approaches heavily rely on annotated
data which are labor intensive. To cope with this limitation, automatically
annotated data generated from graphic engines are used to train segmentation
models. However, the models trained from synthetic data are difficult to transfer
to real images. To tackle this issue, previous works have considered directly
adapting models from the source data to the unlabeled target data (to reduce the
inter-domain gap). Nonetheless, these techniques do not consider the large
distribution gap among the target data itself (intra-domain gap).
In this work, we propose a two-step self-supervised domain adaptation approach
to minimize the inter-domain and intra-domain gap together.
First, we conduct the inter-domain adaptation of the model;
from this adaptation, we separate the target domain into an easy and hard split
using an entropy-based ranking function. Finally, to decrease the intra-domain
gap, we propose to employ a self-supervised adaptation technique from the easy to
the hard split. Experimental results on numerous benchmark datasets highlight the
effectiveness of our method against existing state-of-the-art approaches.
The source code is available at https://github.com/feipanir/IntraDA.
Key Words:
Domain Adaptation, Adversarial Training, Semantic Segmentation, Self-supervised Learning.
Read More
Variational Prototyping-Encoder: One-shot Learning with Prototypical Images.
Junsik Kim, Tae-hyun Oh, Seokju Lee, Fei Pan, In So Kweon.
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2019. [pdf] [code]
In daily life, graphic symbols, such as traffic signs and brand logos,
are ubiquitously utilized around us due to its intuitive expression beyond
language boundary. We tackle an open-set graphic symbol recognition problem
by one-shot classification with prototypical images as a single training example
for each novel class. We take an approach to learn a generalizable embedding space
for novel tasks. We propose a new approach called variational prototyping-encoder (VPE)
that learns the image translation task from real-world input images to their corresponding
prototypical images as a meta-task. As a result, VPE learns image similarity as well as
prototypical concepts which differs from widely used metric learning based approaches.
Our experiments with diverse datasets demonstrate that the proposed VPE performs favorably
against competing metric learning based one-shot methods. Also, our qualitative analyses
show that our meta-task induces an effective embedding space suitable for unseen data
Key Words:
One-Shot Learning, Prototypical Learning, Variational Auto-encoder.
Read More
Driver Drowsiness Detection System Based on Feature Representation Learning Using Various Deep Networks.
Sanghyuk Park, Fei Pan, Sunghun Kang, Chang D. Yoo.
Asian Conference on Computer Vision Workshops (ACCVW), 2016. [pdf]
Statistics have shown that 20% of all road accidents are fatigue-related,
and drowsy detection is a car safety algorithm that can alert a snoozing driver
in hopes of preventing an accident.
This paper proposes a deep architecture referred to as deep drowsiness detection (DDD)
network for learning effective features and detecting drowsiness given a RGB input
video of a driver. The DDD network consists of three deep networks for attaining global
robustness to background and environmental variations and learning local facial
movements and head gestures important for reliable detection.
The outputs of the three networks are integrated and fed to a softmax classifier for
drowsiness detection. Experimental results show that DDD achieves 73.06% detection accuracy on
NTHU-drowsy driver detection benchmark dataset.
Key Words:
Driver Drowsiness Detection, Representation learning.
Read More
The three fundamental problems of computer vision are correspondence, correspondence, and correspondence! -- Takeo Kanade