<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm on Rusty Bower</title><link>https://www.rustybower.com/tags/llm/</link><description>Recent content in Llm on Rusty Bower</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Mon, 23 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://www.rustybower.com/tags/llm/index.xml" rel="self" type="application/rss+xml"/><item><title>Fixing 10,000 Upside-Down Scanned Slides with a Local Vision LLM</title><link>https://www.rustybower.com/posts/fixing-upside-down-scanned-slides-vision-llm/</link><pubDate>Mon, 23 Feb 2026 00:00:00 +0000</pubDate><guid>https://www.rustybower.com/posts/fixing-upside-down-scanned-slides-vision-llm/</guid><description>&lt;p&gt;My grandfather had about 10,000 35mm slides. I rented a &lt;a class="link" href="https://www.slidesnap.com/" target="_blank" rel="noopener"
 &gt;SlideSnap X1&lt;/a&gt; and spent a weekend feeding them through — 33 boxes worth, organized into folders by box. The scanner itself was great, but when you&amp;rsquo;re pushing through thousands of slides in a weekend, some inevitably go in upside down or backwards. No metadata, no EXIF orientation flags — just thousands of JPEGs, some right-side up, some not, sitting on a NAS.&lt;/p&gt;
&lt;p&gt;Manually reviewing 10,000 images isn&amp;rsquo;t realistic. But a vision LLM running on local hardware can look at each one and answer a simple question: is this upside down?&lt;/p&gt;
&lt;h2 id="the-problem"&gt;The problem
&lt;/h2&gt;&lt;p&gt;Scanned slides can be wrong in four orientations:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Orientation&lt;/th&gt;
 &lt;th&gt;What happened&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Correct&lt;/td&gt;
 &lt;td&gt;Slide was loaded properly&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Upside down (180°)&lt;/td&gt;
 &lt;td&gt;Slide was inserted flipped vertically&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mirrored&lt;/td&gt;
 &lt;td&gt;Slide was scanned from the emulsion side&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mirrored + upside down&lt;/td&gt;
 &lt;td&gt;Both problems at once&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Upside-down images are by far the most common issue. Mirror flips are harder to detect without readable text in the image, so I focused on rotation first.&lt;/p&gt;
&lt;h2 id="why-not-digikam-or-opencv"&gt;Why not digiKam or OpenCV?
&lt;/h2&gt;&lt;p&gt;I tried &lt;a class="link" href="https://www.digikam.org/" target="_blank" rel="noopener"
 &gt;digiKam&lt;/a&gt; first. It has an auto-rotate feature that uses DNN-based orientation detection, and it kind of works — but it felt painfully laggy when processing thousands of images, and the accuracy wasn&amp;rsquo;t great. It would confidently leave obviously upside-down photos untouched. For a few hundred photos it&amp;rsquo;s probably fine, but at this scale I needed something I could script, run unattended, and verify afterwards.&lt;/p&gt;
&lt;p&gt;Pure classical computer vision doesn&amp;rsquo;t help much either. You could try detecting faces and checking if they&amp;rsquo;re inverted, but many of these photos are landscapes, buildings, or candid shots without clear faces. Edge detection and gradient analysis can suggest a dominant &amp;ldquo;up&amp;rdquo; direction, but they&amp;rsquo;re unreliable on photos with ambiguous composition — a snow-covered mountain reflected in a lake, for example.&lt;/p&gt;
&lt;p&gt;The task requires the kind of semantic understanding that a human uses: sky goes up, people stand on their feet, buildings point upward, text reads left-to-right. A vision language model does exactly this.&lt;/p&gt;
&lt;h2 id="the-setup"&gt;The setup
&lt;/h2&gt;&lt;p&gt;I already had &lt;a class="link" href="https://ollama.ai" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt; running on a Mac Studio (M1 Ultra, 64 GB RAM) for &lt;a class="link" href="https://frigate.video" target="_blank" rel="noopener"
 &gt;Frigate&amp;rsquo;s&lt;/a&gt; camera event descriptions. The Mac Studio sits on the same network as the NAS, so the pipeline is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read image from NAS (SMB mount)&lt;/li&gt;
&lt;li&gt;Send to Ollama&amp;rsquo;s vision model via HTTP API&lt;/li&gt;
&lt;li&gt;Parse the response&lt;/li&gt;
&lt;li&gt;Record the result&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the model, I used &lt;code&gt;llama3.2-vision&lt;/code&gt; (11B parameters). It fits comfortably in 64 GB of RAM and processes each image in about 15 seconds. The smaller &lt;code&gt;minicpm-v&lt;/code&gt; was faster but significantly less accurate.&lt;/p&gt;
&lt;h2 id="what-didnt-work-the-grid-approach"&gt;What didn&amp;rsquo;t work: the grid approach
&lt;/h2&gt;&lt;p&gt;My first attempt was clever but wrong. I created a 2x2 grid showing all four possible orientations of each image (original, mirrored, rotated 180°, mirrored+rotated), labeled A through D, and asked the model to pick which one looked correct.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;┌─────────┬─────────┐
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ A: orig │ B: mirr │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;├─────────┼─────────┤
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ C: 180° │ D: both │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;└─────────┴─────────┘
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The model said &amp;ldquo;A&amp;rdquo; for every single image. The grid made each sub-image too small to reason about, and the model defaulted to the first option. Five for five wrong.&lt;/p&gt;
&lt;h2 id="what-worked-one-question-at-a-time"&gt;What worked: one question at a time
&lt;/h2&gt;&lt;p&gt;Splitting the problem into a simple binary question on the full-resolution image worked much better. The prompt:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Look at this scanned photograph. Is it UPSIDE DOWN?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Check these things:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Are people&amp;#39;s heads at the BOTTOM of the image? That means upside down.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Is the sky or ceiling at the BOTTOM? That means upside down.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Are buildings, trees, or poles pointing DOWNWARD? That means upside down.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Is the ground or floor at the TOP of the image? That means upside down.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Respond in EXACTLY this format (two lines, nothing else):
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;UPSIDE_DOWN: YES or NO
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;REASON: Brief explanation
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On a test set of 5 images (3 upside-down, 2 correct), this got all 5 right. The key insight: give the model the full image at a reasonable resolution, ask one simple question, and tell it exactly what format to respond in.&lt;/p&gt;
&lt;h2 id="blur-detection-for-free"&gt;Blur detection for free
&lt;/h2&gt;&lt;p&gt;While I was processing each image, I also wanted to flag blurry scans that might need to be re-done. This part doesn&amp;rsquo;t need an LLM at all — the &lt;a class="link" href="https://docs.opencv.org/4.x/d5/db5/tutorial_laplacian.html" target="_blank" rel="noopener"
 &gt;Laplacian variance&lt;/a&gt; method works well:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_blur_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IMREAD_GRAYSCALE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Normalize for different scan resolutions&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Laplacian&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CV_64F&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The Laplacian operator approximates the second derivative of the image — sharp edges produce high values, blurry regions produce low values. The variance of the Laplacian across the whole image gives a single sharpness score. Higher is sharper.&lt;/p&gt;
&lt;p&gt;A threshold of 100 worked well for these scans. Anything below that was visibly soft.&lt;/p&gt;
&lt;h2 id="sampling-before-committing"&gt;Sampling before committing
&lt;/h2&gt;&lt;p&gt;Running 10,000 images through a vision model at 15 seconds each would take about 42 hours. Before committing to that, I sampled 3 random images from each of the 33 folders to estimate which ones had problems:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Folder Total Sampled Upside↓ %
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;------------------------- ------ -------- -------- ------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 01 81 3 2 67% &amp;lt;&amp;lt;&amp;lt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 02 70 3 2 67% &amp;lt;&amp;lt;&amp;lt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 03 79 3 2 67% &amp;lt;&amp;lt;&amp;lt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 04 81 3 2 67% &amp;lt;&amp;lt;&amp;lt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 05 78 3 1 33%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 06 541 3 1 33%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 07 441 3 1 33%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 08 420 3 0 0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 09 1305 3 0 0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 10 1085 3 0 0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Batch 11 1373 3 0 0%
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;...
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;99 sample images, 13 minutes, and I had a priority map. Four folders had 67% upside-down rates — those ~311 images should be processed first. The large folders (4,000+ images) looked clean. The sampling pass likely saved 30+ hours of unnecessary processing.&lt;/p&gt;
&lt;h2 id="the-script"&gt;The script
&lt;/h2&gt;&lt;p&gt;The full script is about 400 lines of Python. The core loop is straightforward:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;image_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Blur score (instant, no LLM needed)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;blur_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_blur_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Orientation check (sends image to Ollama)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;img_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image_to_b64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ask_vision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ollama_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UPSIDE_DOWN_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img_b64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;is_upside_down&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parse_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ImageResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;blur_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;blur_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;orientation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;C&amp;#34;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_upside_down&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;A&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;correction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;rotate_180&amp;#34;&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_upside_down&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;none&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It generates an HTML report with side-by-side thumbnails (as-scanned vs. corrected) so you can visually verify before applying fixes. The fixes themselves are non-destructive — each corrected image gets a &lt;code&gt;.original&lt;/code&gt; backup:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Review first&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python slide_analyzer.py &lt;span class="s2"&gt;&amp;#34;/Volumes/slides/batch-01&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Apply corrections from the report&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python slide_analyzer.py --fix-from slide_report.json
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="performance"&gt;Performance
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Metric&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Model&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;llama3.2-vision&lt;/code&gt; (11B)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Hardware&lt;/td&gt;
 &lt;td&gt;Mac Studio M1 Ultra, 64 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Time per image&lt;/td&gt;
 &lt;td&gt;~15 seconds&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Accuracy (test set)&lt;/td&gt;
 &lt;td&gt;5/5&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Sample pass (99 images)&lt;/td&gt;
 &lt;td&gt;13 minutes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Estimated full run (10,000 images)&lt;/td&gt;
 &lt;td&gt;~42 hours&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Estimated full run (priority folders only)&lt;/td&gt;
 &lt;td&gt;~1.3 hours&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="what-id-do-differently"&gt;What I&amp;rsquo;d do differently
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Mirror detection is hard.&lt;/strong&gt; I initially tried to detect horizontal flips too, but the model was unreliable — it flagged correctly-oriented images as mirrored. Without readable text in the photo, there often aren&amp;rsquo;t enough visual cues. Mirror detection probably needs a more capable model or a different approach (OCR on the image, then check if the text is backwards).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A larger model might help.&lt;/strong&gt; The 11B &lt;code&gt;llama3.2-vision&lt;/code&gt; worked well for upside-down detection, which is a relatively easy spatial reasoning task. Mirror detection is subtler and might benefit from a 34B+ parameter model. With 64 GB of RAM, &lt;code&gt;llava:34b&lt;/code&gt; would fit, but I haven&amp;rsquo;t tested it yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch the Ollama calls.&lt;/strong&gt; The current script processes images sequentially because Ollama handles one inference at a time on a single GPU. If you had multiple GPUs or were using a cloud API, you could parallelize this significantly.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The combination of &amp;ldquo;cheap classical CV for the easy stuff&amp;rdquo; (blur detection) and &amp;ldquo;local vision LLM for the semantic stuff&amp;rdquo; (orientation detection) turned a multi-day manual review into something that runs overnight. The sampling strategy — check a few images per folder before processing everything — is the kind of optimization that seems obvious in retrospect but saves enormous amounts of time.&lt;/p&gt;
&lt;p&gt;The script, the Ollama model, and the NAS are all local. No images leave the network, no API costs, no rate limits. That matters when the photos are your grandfather&amp;rsquo;s personal history.&lt;/p&gt;</description></item></channel></rss>