Dispite my own promises, here is a hybrid followup for both [BOORU_CHARS datasets](https://nyaa.si/view/1740396)
and [safebooru centric composite rips](https://nyaa.si/view/1733499).
This time a main source was **danbooru** (safe+questionable, interval **ID 6640000..8200000 = 31.08.2023..24.09.2024**),
the best of furry-related e621 and loli-enabled gelbooru for the same interval used as addon.
Similar to rips :
- images initially filtered Mpixels>=0.48, shorter_side>=600 px, volume>=60000 bytes, no animations
stripes dropped or adjusted to aspect ratio 0.4..2.1
- PNG/WEBP/AVIF converted to JPG using cjpegli 96% quality (2000000 bytes limit)
modest sampling done to typical longer side 2560px (landscape) 1920px (1x1) 2480px (portrait)
- verbose file naming used **"%website% - %id% - %up_to_3_copyrights% ~ %up_to_5_characters% (%up_to_2_artists%).%ext%"**
files uniquely identified by "%website%+%id%"
Similar to BOORU CHARS datasets extensive processing done and used for content sorting :
- some general image statistics got with EXIFTOOL and IMAGE MAGICK
- content analisys mostly the same as for BC2023 with actual software and models
- [CRAFT text detector](https://github.com/fcakyon/craft-text-detector) used to estimate total size and number of text pieces
- torso components detected with [custom PyTorch model](https://github.com/aperveyev/booru_yolo/tree/main/models)
being built over [Ultralitics YOLOv11](https://github.com/ultralytics/ultralytics)
- imageboard tags arranged and partially placed inside image EXIF-info
- clustering implemented both
- by aspect ratio { 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40% }
- by head-count { 0 heads = letter A, 2 = B, 3-5 heads = C, 6+ heads = D, 1 = letter E }
- sorting inside cluster based on "attractiveness score function" == "colorful and textless"
- balanced folder/zip typically contains ~1000-2600 files
- least rated images tend to be manga-like and manually reviewed
Content is a little less processed and a little more NSFW compared to predecessors, nevertheless :
- real-life photos, no-character landscapes, foods and macro thrown away
- most of comic and N-koma, overtexted images and line-arts filtered out
- too "questionable" images (uncensored nipples or vulva, obvious adult actions) excluded
separate BOORU BOOBS planned
- some background crops, gamma correction, rotation, denoise and other nontrivial improvements implemented
Images deduplicatied with AntiDupl dot NET up to 2% similarity along with BOORU CHARS 2023 and 2022.
Beside images release contains tab separated texts :
- **BC_2024.tsv** file/image related metadata 1.260.629 rows
- **BC_2024_tags.tsv** tags list with Danbooru enrichment 49.041.220 rows
- **BC_2024_yolo.tsv** detailed results for torso components detection 4.431.887 rows
and also dedicated "readme" with structures description.
**Keep in mind this release is first of all
a dataset of character-centric art in effective local format suited for batch processing
and then
a representative catalog of anime/game/cartoon copyrigths, characters and artists for visual estimation
but not
a complete and maximum quality rip.**
Some tips on use cases :
```
@REM -- explore torrent
for %%F in ("d:\torr\BOORU_CHARS_2024\2024-3x4\*.zip") do 7z x -r -o"C:\TEMP\" "%%F" *sousou*frieren*stark*
@REM -- much more effective if unzipped
xcopy /s "A:\BCA\*sousou*frieren*stark*" C:\TEMP\
-- and became sophisticated using database
select 'xcopy "'||bc.fpath||'\'||bc.fname||'" C:\TEMP\' xcpy
from bc
join bc_dt d on d.booru=bc.booru and d.fid=bc.fid
where bc.fname like '%dungeon%meshi%senshi%' and d.tag='pantyshot' -- brutal dwarf fanservice
```
Comments - 0