• bitcoinBitcoin(BTC)$76,801.00-2.05%
  • ethereumEthereum(ETH)$2,287.99-3.30%
  • tetherTether(USDT)$1.00-0.03%
  • rippleXRP(XRP)$1.39-2.53%
  • binancecoinBNB(BNB)$622.70-1.88%
  • usd-coinUSDC(USDC)$1.000.01%
  • solanaSolana(SOL)$84.43-2.81%
  • tronTRON(TRX)$0.3257100.45%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.031.51%
  • dogecoinDogecoin(DOGE)$0.097854-1.49%
  • whitebitWhiteBIT Coin(WBT)$54.28-2.21%
  • USDSUSDS(USDS)$1.00-0.01%
  • HyperliquidHyperliquid(HYPE)$41.53-0.38%
  • leo-tokenLEO Token(LEO)$10.340.51%
  • cardanoCardano(ADA)$0.245429-2.83%
  • bitcoin-cashBitcoin Cash(BCH)$448.32-1.20%
  • moneroMonero(XMR)$380.08-3.72%
  • chainlinkChainlink(LINK)$9.23-2.76%
  • zcashZcash(ZEC)$356.08-1.51%
  • CantonCanton(CC)$0.148027-1.59%
  • stellarStellar(XLM)$0.165015-3.53%
  • MemeCoreMemeCore(M)$3.93-9.45%
  • daiDai(DAI)$1.000.02%
  • USD1USD1(USD1)$1.00-0.01%
  • litecoinLitecoin(LTC)$55.33-1.72%
  • avalanche-2Avalanche(AVAX)$9.19-3.16%
  • hedera-hashgraphHedera(HBAR)$0.089705-3.41%
  • Ethena USDeEthena USDe(USDE)$1.00-0.01%
  • suiSui(SUI)$0.92-2.50%
  • shiba-inuShiba Inu(SHIB)$0.000006-1.86%
  • RainRain(RAIN)$0.007379-0.93%
  • paypal-usdPayPal USD(PYUSD)$1.000.02%
  • the-open-networkToncoin(TON)$1.30-0.47%
  • crypto-com-chainCronos(CRO)$0.069482-1.11%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • tether-goldTether Gold(XAUT)$4,670.11-0.79%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • BittensorBittensor(TAO)$247.31-1.49%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.072548-3.45%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • pax-goldPAX Gold(PAXG)$4,671.23-0.75%
  • mantleMantle(MNT)$0.63-3.52%
  • polkadotPolkadot(DOT)$1.22-3.85%
  • SkySky(SKY)$0.087985-0.53%
  • uniswapUniswap(UNI)$3.22-2.44%
  • Pi NetworkPi Network(PI)$0.181535-0.82%
  • Falcon USDFalcon USD(USDF)$1.000.02%
  • okbOKB(OKB)$83.35-1.36%
  • nearNEAR Protocol(NEAR)$1.35-2.98%
  • HTX DAOHTX DAO(HTX)$0.0000020.73%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

April 27, 2026
in AI & Technology
Reading Time: 8 mins read
A A
Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo
ShareShareShareShareShare

If you’ve ever watched a motion capture system struggle with a person’s fingers, or seen a segmentation model fail to distinguish teeth from gums, you already understand why human-centric computer vision is hard. Humans are not just objects, they come with articulated structure, fine surface details, and enormous variation in pose, clothing, lighting, and ethnicity. Getting a model to understand all of that, at once, across arbitrary real-world images, is genuinely difficult.

Meta AI research team introduced Sapiens2, the second generation of its foundation model family for human-centric vision. Trained on a newly curated dataset of 1 billion human images, spanning model sizes from 0.4B to 5B parameters, and designed to operate at native 1K resolution with hierarchical variants supporting 4K, Sapiens2 is a substantial leap over its predecessor across every benchmark the team evaluated.

YOU MAY ALSO LIKE

A Star Wars expansion is coming to PowerWash Simulator 2

RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk

https://arxiv.org/pdf/2604.21681

What Sapiens2 is Trying to Solve

The original Sapiens model relied primarily on Masked Autoencoder (MAE) pretraining. MAE works by masking a large portion of input image patches, 75% in this case, and training the model to reconstruct the missing pixels. This forces the model to learn spatial details and textures, which is useful for dense prediction tasks like segmentation or depth estimation.

The problem is that MAE, as a form of masked image modeling (MIM), learns largely through compression. It doesn’t naturally learn high-level semantics. It can tell you what something looks like, but not necessarily what it means in the context of a human body. That’s where contrastive learning (CL) methods like DINO and SimCLR shine: they organize representations semantically by training the model to treat different views of the same image as similar and views of different images as distinct.

But CL has its own tradeoff. Its aggressive augmentation strategies like color jitter, blurring, can strip away appearance cues like skin tone or lighting conditions that are critical for tasks like albedo estimation (recovering the true color of a surface independent of lighting). This is what the research team calls representation drift.

Sapiens2 addresses this problem directly by combining both objectives: a masked image reconstruction loss (LMAE) to preserve low-level fidelity, and a global contrastive loss (LCL) on the [CLS] token using a student-teacher framework based on DINOv3, where the teacher’s parameters are an exponential moving average (EMA) of the student. Crucially, color augmentations are not applied to global views used for the MAE objective, preserving the appearance cues needed for photorealistic tasks. The joint objective is L = LMAE + λLCL.

https://arxiv.org/pdf/2604.21681

The Data: Humans-1B

Getting 1 billion training images right required a multi-stage filtering pipeline. Starting from a web-scale pool of approximately 4 billion images, Meta team applied bounding box detection, head-pose estimation, aesthetic and realism scoring, CLIP-based feature filtering, and text-overlay detection. The result is a curated corpus where every image contains at least one prominent person with a minimum short-side resolution of 384 pixels.

To ensure diversity, the research team used perceptual hashing and deep-feature nearest-neighbor pruning for deduplication, then clustered visual embeddings and applied selective sampling to balance the dataset across poses, viewpoints, occlusion levels, clothing types, and lighting conditions. No task labels or human-specific priors were injected during pretraining — just images.

The Architecture: Scaling to 5B and 4K

Sapiens2 introduces four model sizes: 0.4B, 0.8B, 1B, and 5B parameters, each at native 1K resolution. The 5B model is the highest-FLOPs vision transformer reported to date at 15.722 TFLOPs.

For 4K resolution, the research team adopted a hierarchical windowed attention design. The first K layers apply windowed self-attention locally to capture fine texture and boundaries within spatial windows. A [CLS]-guided pooling step then downsamples the 2D token grid by a spatial stride √ω, and the subsequent L layers apply global self-attention over this reduced sequence. This layout is compatible with MAE-style pretraining because masked tokens can be dropped after the local stage, preventing information from leaking across masked regions — a problem that convolutional backbones typically need masked convolutions to avoid.

The masking strategy itself is also carefully designed: Sapiens2 uses mixed blockwise/patchwise masking (blockwise probability 0.4) at a 75% mask ratio with patch size 16. At 1024×768 resolution (64×48 = 3072 patches), this masks approximately 2304 patches per image which is enough to create coarse occlusions that regularize MAE while preserving sufficient context for the contrastive objective.

For stability at scale, the architecture incorporates several improvements: RMSNorm replacing LayerNorm, Grouped-Query Attention (GQA) in mid-depth blocks for higher throughput, QK-Norm for robust high-resolution training, and SwiGLU feed-forward layers. The decoder uses pixel-shuffle upsampling for sub-pixel reasoning. Decoder output resolution was also increased from 0.5K to 1K for base backbones, and to 2K for 4K backbones.

Post-Training: Five Human Tasks, 10× More Supervision

A critical improvement over the original Sapiens is the scale and quality of task-specific supervision. Relative to the first generation, Sapiens2 scales task-specific labels by 10×, typically reaching around 1 million labels per task. After pretraining, the backbone is fine-tuned for five downstream tasks using lightweight task-specific heads while leaving the backbone unchanged:

  • Pose Estimation: A 308-keypoint full-body skeleton with dense face (243 keypoints) and hand (40 keypoints) coverage. The research team newly annotated 100K in-the-wild images to complement studio capture data, improving generalization significantly.
  • Body-Part Segmentation: 29 semantic classes (extended from 28 by adding eyeglasses), trained with per-pixel weighted cross-entropy combined with Dice loss for sharper boundaries.
  • Pointmap Estimation: Rather than predicting relative depth, Sapiens2 regresses a per-pixel 3D pointmap P̂(u) ∈ ℝ³ in the camera frame — a harder task that requires reasoning about camera intrinsics.
  • Normal Estimation: Per-pixel surface unit normals, decoded using multiple PixelShuffle layers for artifact-free upsampling.
  • Albedo Estimation: Per-pixel diffuse albedo Â(u) ∈ [0,1]³, trained purely on synthetic high-fidelity data and designed to recover true skin tone and clothing color under varying illumination.

Results

The numbers are difficult to argue with. On the 11K-image in-the-wild pose test set, Sapiens2-5B achieves 82.3 mAP compared to 78.3 mAP for Sapiens-2B — a +4 mAP improvement. On body-part segmentation, even the smallest model, Sapiens2-0.4B, scores 79.5 mIoU (+21.3 over Sapiens-2B*), while Sapiens2-5B reaches 82.5 mIoU — a +24.3 mIoU gain over the previous generation’s largest model. The 4K variant, Sapiens2-1B-4K, further pushes segmentation to 81.9 mIoU and 92.0 mAcc, demonstrating the benefit of higher-resolution reasoning.

On surface normal estimation, Sapiens2-0.4B already achieves a mean angular error of 8.63°, outperforming the previous state-of-the-art DAViD-L at 10.73°. The 5B model brings this down further to 6.73°, and the 4K variant reaches 6.98° with a median angular error of just 3.08°.

For albedo estimation, Sapiens2-5B achieves an MAE of 0.012 and a PSNR of 32.61 dB, with consistent improvement across all model sizes. On pointmap estimation, all Sapiens2 model sizes outperform MoGe, which was previously state-of-the-art for monocular geometry estimation.

In dense probing evaluations, where the backbone is frozen and only lightweight decoders are trained with identical hyperparameters, Sapiens2-5B surpasses all baselines across every task, including DINOv3-7B (6.71B parameters), despite Sapiens2 being a human-specialist model evaluated against a general-purpose backbone nearly 1.5× its size.


Check out the Model Weights with Demos, Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us


Credit: Source link

ShareTweetSendSharePin

Related Posts

A Star Wars expansion is coming to PowerWash Simulator 2
AI & Technology

A Star Wars expansion is coming to PowerWash Simulator 2

April 27, 2026
RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk
AI & Technology

RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk

April 27, 2026
Ford’s Mustang Cobra Jet sets a new EV quarter mile record at 6.87 seconds
AI & Technology

Ford’s Mustang Cobra Jet sets a new EV quarter mile record at 6.87 seconds

April 27, 2026
The LoRA Assumption That Breaks in Production 
AI & Technology

The LoRA Assumption That Breaks in Production 

April 27, 2026
Next Post
Man arrested after molotov cocktail thrown at OpenAI CEO Sam Altman’s house

Man arrested after molotov cocktail thrown at OpenAI CEO Sam Altman's house

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Rep. Gonzales says he will ‘retire’ tomorrow after sexual misconduct accusations

Rep. Gonzales says he will ‘retire’ tomorrow after sexual misconduct accusations

April 25, 2026
Energy And The AI Buildout: An Investor’s Perspective

Energy And The AI Buildout: An Investor’s Perspective

April 22, 2026
Vance hopes to have a ‘positive’ negotiation to end conflict with Iran in Pakistan

Vance hopes to have a ‘positive’ negotiation to end conflict with Iran in Pakistan

April 27, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!