Decomposing the Observable Self
How much of a person can be understood from what is visible? This question has occupied me as I work on systems that extract signals from images and video. The technology is advancing rapidly, but I find myself more interested in its limits than its capabilities.
The Inner-Outer Distinction
William James (1890) distinguished between the “I” (the knowing self) and the “Me” (the known self—the self as object). The “Me” includes the material self (body, possessions), the social self (recognition from others), and the spiritual self (inner thoughts, dispositions). Only portions of the “Me” are externally observable.
Goffman (1959) extended this with dramaturgy: we perform versions of ourselves for different audiences. The “front stage” self—what others see—is curated, contextual, and strategic. The “back stage” self remains hidden.
What this means to me: Any system that observes humans captures only performances, not essences. When I look at a photograph of someone, I’m seeing a moment they chose to present (or had presented for them). This isn’t a limitation to overcome—it’s a fundamental truth about what observation can access. I think we forget this too easily when building technology.
What Can Be Extracted from Images and Video
Face Detection: The Foundation
Modern face detection has achieved near-human performance under good conditions:
| System/Study | Dataset | Accuracy | Notes |
|---|---|---|---|
| RetinaFace (Deng et al., 2020) | WIDER FACE (hard) | 91.4% AP | State-of-the-art |
| MTCNN (Zhang et al., 2016) | FDDB | 95.1% recall | Widely deployed |
| Human performance | FDDB | ~94% | Baseline comparison |
However, accuracy degrades significantly with:
- Occlusion: 30-40% drop with partial face coverage (Ge et al., 2017)
- Pose variation: 15-25% drop at extreme angles >60° (Zhu & Ramanan, 2012)
- Low resolution: Below 20×20 pixels, detection drops to <50% (Yang et al., 2016)
- Lighting: Low-light conditions reduce accuracy by 20-35% (Li et al., 2019)
What this means: The 91-95% accuracy numbers are misleading for real-world use. They come from benchmark conditions that don’t reflect how people actually appear in photos and videos—partially obscured, badly lit, at odd angles. I’ve learned to mentally discount benchmark performance by 20-30% when thinking about practical applications.
Facial Landmark Detection
68-point landmark models achieve:
| Model | Dataset | NME (Normalized Mean Error) |
|---|---|---|
| HRNet (Sun et al., 2019) | 300W | 2.87% |
| AWing (Wang et al., 2019) | WFLW | 4.36% |
| Human inter-rater | 300W | ~3.0% |
What this means: Landmark detection is genuinely good—approaching human-level agreement. This is the kind of geometric signal I trust. It’s measuring shape, not meaning.
Body Pose Estimation
Skeleton estimation accuracy (PCK @ 0.5 threshold—correct if within 50% of head size):
| Model | Dataset | PCK@0.5 |
|---|---|---|
| HRNet (Sun et al., 2019) | COCO | 76.3% |
| OpenPose (Cao et al., 2019) | COCO | 65.3% |
| AlphaPose (Fang et al., 2017) | COCO | 72.3% |
Multi-person scenarios show 10-15% degradation due to occlusion and association errors.
What this means: Pose estimation is useful but imperfect. 76% means roughly 1 in 4 joint positions is noticeably wrong. For measuring gross body position—standing, sitting, facing toward/away—it’s adequate. For fine-grained gesture analysis, I’d be cautious.
The Problem with Emotion Recognition
This is where I become skeptical of the entire enterprise.
Claimed accuracy in controlled conditions:
| System | Dataset | Reported Accuracy |
|---|---|---|
| Commercial API A | FER2013 | 71% |
| Commercial API B | AffectNet | 65% |
| Academic SOTA | RAF-DB | 88% |
But these numbers are misleading. Barrett et al. (2019) conducted a meta-analysis of 1,000+ studies and found:
- Within-culture agreement on facial expressions: r = 0.39 to 0.58
- Cross-cultural agreement: drops to r = 0.20 to 0.45
- Reliability of “basic emotion” categories: challenged by 25%+ disagreement even in posed expressions
Key findings from their meta-analysis:
| Expression | % Agreement (Western) | % Agreement (Cross-cultural) |
|---|---|---|
| Happiness | 90% | 69% |
| Sadness | 75% | 56% |
| Anger | 74% | 59% |
| Fear | 65% | 48% |
| Disgust | 65% | 45% |
| Surprise | 83% | 67% |
What this means to me: This data changed how I think about the field. If humans only agree 48% of the time on what “fear” looks like across cultures, then any algorithm trained on Western-labeled data is encoding one culture’s interpretation as universal truth. The 71% accuracy of a commercial API isn’t measuring “emotion detection”—it’s measuring agreement with Western labelers’ interpretations of posed expressions.
I find this troubling because these systems are being deployed as if they measure something real. But they’re measuring agreement with a training set, not ground truth about human experience.
Facial Action Units: A More Honest Alternative
The Facial Action Coding System (FACS; Ekman & Friesen, 1978) describes muscle movements, not emotions:
| AU | Description | Detection Accuracy (F1) |
|---|---|---|
| AU1 | Inner brow raiser | 0.51 |
| AU2 | Outer brow raiser | 0.45 |
| AU4 | Brow lowerer | 0.55 |
| AU6 | Cheek raiser | 0.76 |
| AU12 | Lip corner puller | 0.86 |
| AU15 | Lip corner depressor | 0.38 |
| AU25 | Lips part | 0.92 |
Data from BP4D-Spontaneous dataset (Zhang et al., 2014).
What this means: AUs are more honest because they describe what’s actually observable—muscle movements—without claiming to know what they mean. AU12 (lip corner puller) is reliably detectable at 0.86 F1. Whether that “smile” indicates joy, politeness, nervousness, or a photographer’s instruction is a separate question entirely.
This is the approach I want to take with taocore-human: describe the signal, not the interpretation.
Demographic Attributes: Where Accuracy Meets Ethics
Gender Classification Bias
Buolamwini & Gebru (2018) tested three commercial systems on the Pilot Parliaments Benchmark:
| Demographic Group | System A | System B | System C |
|---|---|---|---|
| Lighter-skinned males | 0.8% error | 0.0% error | 0.0% error |
| Lighter-skinned females | 6.0% error | 1.7% error | 7.1% error |
| Darker-skinned males | 12.0% error | 0.7% error | 6.0% error |
| Darker-skinned females | 34.7% error | 21.3% error | 34.5% error |
What this means: Error rates for darker-skinned females were up to 43× higher than for lighter-skinned males. This isn’t a bug to be fixed with more data—it reflects whose faces were considered worth including in training sets. The bias is structural.
My position: I don’t think we should build systems that classify gender from appearance. It conflates sex, gender identity, and gender expression. It fails for people who don’t fit binary categories. And even when it “works,” it reinforces the idea that gender is something readable from the surface.
Age Estimation
State-of-the-art mean absolute error (MAE) in years:
| Model | MORPH II | FG-NET | UTKFace |
|---|---|---|---|
| DEX (Rothe et al., 2018) | 2.68 | 4.63 | - |
| SSR-Net (Yang et al., 2018) | 3.16 | 4.48 | 5.21 |
But MAE varies significantly by actual age:
| Age Group | Typical MAE |
|---|---|
| 0-20 | 3-5 years |
| 20-40 | 4-6 years |
| 40-60 | 5-8 years |
| 60+ | 8-12 years |
And by demographics (Ricanek & Tesafaye, 2006):
- African American faces: +1.5 years MAE vs. Caucasian
- Female faces: +0.8 years MAE vs. male (on Western-trained models)
What this means: Age estimation is less politically charged than gender, but still unreliable. ±8 years for someone over 60 means the system might guess 55 or 70 for someone who is 63. If taocore-human ever reports age, it should be as a wide range with explicit uncertainty—“apparent age: 55-70”—not a point estimate.
Race Classification: Why I Won’t Build It
Accuracy seems high in benchmarks:
| Study | Dataset | Accuracy |
|---|---|---|
| Fu et al. (2014) | MORPH II (3 classes) | 99.1% |
| Guo & Mu (2014) | PCSO (2 classes) | 98.3% |
But these benchmarks are fundamentally flawed:
- Limited categories: Most use 2-4 racial categories, erasing mixed-race and many ethnic groups
- Labeling inconsistency: Inter-rater agreement on race labels is only 85-90% (Albiero et al., 2020)
- Downstream harm: These systems have led to wrongful arrests (Hill, 2020)
My position: Race is a social construct, not a biological category that can be read from faces. The high “accuracy” of these systems means they’re good at reproducing the racial categories their training data encoded—not that race is something objectively measurable.
I will not build race classification into taocore-human. Some capabilities shouldn’t exist.
Reliability Across Conditions
Real-world performance differs dramatically from benchmark performance:
| Condition | Face Detection | Landmark | Pose | Expression |
|---|---|---|---|---|
| Lab conditions | 95%+ | 97%+ | 85%+ | 70%+ |
| Good natural light | 90%+ | 92%+ | 75%+ | 55%+ |
| Indoor variable | 80%+ | 85%+ | 65%+ | 45%+ |
| Low light | 60%+ | 70%+ | 50%+ | 30%+ |
| Motion blur | 55%+ | 60%+ | 45%+ | 25%+ |
| Occlusion (>30%) | 50%+ | 55%+ | 40%+ | 20%+ |
What this means: The gap between benchmark and reality is enormous. A system that’s 95% accurate in the lab might be 50% accurate on your actual photos. This is why taocore-human needs to report confidence and refuse to interpret when conditions are poor. Silence is more honest than false confidence.
The Signal-to-Meaning Gap
I think of this as a hierarchy where reliability degrades at each level:
Level 1: Pixels (raw data)
↓ [~5% information loss - compression, noise]
Level 2: Geometric features (faces, poses, positions)
↓ [10-20% error - detection/estimation]
Level 3: Behavioral signals (AUs, pose configurations)
↓ [30-50% error - context-dependent mapping]
Level 4: Psychological inference (emotions, intentions)
↓ [50-70% error - weak construct validity]
Level 5: Identity claims (who someone "is")
↓ [undefined - philosophical category error]
My principle for taocore-human: Operate at Levels 2-3. Report geometric features and behavioral signals with confidence intervals. Do not attempt Levels 4-5.
The temptation is always to climb higher—to say something more meaningful. But meaning without validity is noise dressed up as signal.
What Else Helps Us Understand Ourselves?
Voice and Speech
Paralinguistic features show moderate reliability:
| Feature | Task | Accuracy/Correlation |
|---|---|---|
| Pitch variation | Arousal detection | r = 0.45-0.65 |
| Speech rate | Stress detection | r = 0.35-0.50 |
| Voice quality | Depression screening | AUC = 0.70-0.80 |
Data from Schuller & Batliner (2013).
What this means: Voice carries real signal—more than I initially expected. The correlations with arousal and stress are moderate but meaningful. This might be a future direction for taocore-human, though audio raises its own privacy concerns.
Physiological Signals
| Signal | What It Measures | Reliability (r with self-report) |
|---|---|---|
| Heart rate variability | Arousal/stress | 0.40-0.60 |
| Skin conductance | Emotional activation | 0.50-0.70 |
| Pupil dilation | Cognitive load, arousal | 0.35-0.55 |
Data from Kreibig (2010) meta-analysis.
What this means: Physiological signals correlate with self-reported states better than facial expressions do. This makes sense—they’re measuring the body’s actual response, not a culturally mediated display. But they require sensors, which limits practical application.
Framework for Ethical Signal Extraction
Based on the data above, here’s how I think about what taocore-human should and shouldn’t do:
| Signal Type | Reliability | Should Extract? | How to Report |
|---|---|---|---|
| Face presence/location | High (>90%) | Yes | Bounding box + confidence |
| Facial landmarks | Moderate-High (85-95%) | Yes | Points + NME estimate |
| Body pose | Moderate (65-80%) | Yes | Skeleton + PCK confidence |
| Facial AUs | Variable (38-92%) | Selective | Only high-reliability AUs |
| Emotion labels | Low (<60%) | No | - |
| Gender | Biased (65-99%) | No | - |
| Race | Invalid construct | No | - |
| Age | Moderate (±5-8 years) | With uncertainty | Range, not point estimate |
Conclusion: What I’ve Learned
Working through this literature has clarified my thinking:
-
Geometric signals are trustworthy. Where faces are, how bodies are positioned—these are measurable with reasonable accuracy.
-
Behavioral signals require humility. AUs and pose configurations can be detected, but what they mean depends on context we often don’t have.
-
Psychological inferences are overreach. The emotion recognition industry is built on shaky foundations. I don’t want to contribute to it.
-
Identity categorizations are harmful. Race and gender classification encode social constructs as biological facts. They should not be built.
The goal of taocore-human is not to understand people. It’s to describe observable patterns while being honest about what remains unknown. The data tells us clearly: most of what matters about a person is not visible on the surface.
References
Albiero, V., et al. (2020). Analysis of gender inequality in face recognition accuracy. IEEE/CVF WACV Workshops, 81-89.
Barrett, L. F., et al. (2019). Emotional expressions reconsidered. Psychological Science in the Public Interest, 20(1), 1-68.
Buolamwini, J., & Gebru, T. (2018). Gender shades. FAT, 77-91.
Cao, Z., et al. (2019). OpenPose. IEEE TPAMI, 43(1), 172-186.
Deng, J., et al. (2020). RetinaFace. IEEE/CVF CVPR, 5203-5212.
Ekman, P., & Friesen, W. V. (1978). Facial Action Coding System. Consulting Psychologists Press.
Goffman, E. (1959). The Presentation of Self in Everyday Life. Anchor Books.
Hill, K. (2020). Wrongfully accused by an algorithm. The New York Times, June 24.
James, W. (1890). The Principles of Psychology. Henry Holt.
Kreibig, S. D. (2010). Autonomic nervous system activity in emotion. Biological Psychology, 84(3), 394-421.
Ricanek, K., & Tesafaye, T. (2006). MORPH. FG, 341-345.
Rothe, R., et al. (2018). Deep expectation of real and apparent age. IJCV, 126(2-4), 144-157.
Schuller, B., & Batliner, A. (2013). Computational Paralinguistics. Wiley.
Sun, K., et al. (2019). Deep high-resolution representation learning. IEEE/CVF CVPR, 5693-5703.
Zhang, K., et al. (2016). MTCNN. IEEE SPL, 23(10), 1499-1503.
Zhang, X., et al. (2014). BP4D-Spontaneous. Image and Vision Computing, 32(10), 692-706.