Rendered at 08:55:03 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
bennettdixon 18 hours ago [-]
Nice write up, one thing that stood out is the V2 to V3 jump. One of my clients is integrating personal wellness & AI, and we took a slightly different route. The health data and personal data live in separate dbs with an encrypted mapping layer between. This way the model only sees health context attached to a unique pseudo-user level session. Your problem almost seems harder, because the PII is the signal/context. One challenge we are facing is re-identification, e.g rich-health profiles being identifiable in themselves.
Curious if you have thought about that side of things with your V3 implementation?
n00pn00p 18 hours ago [-]
That's a great point. Because my tool is designed for security operations and triage, the context (like knowing an IP is from Hetzner, or a domain is a known burner) is actually the signal the LLM needs to do its job. I made a conscious trade-off to allow some contextual metadata to pass through to preserve utility.
Since I'm based in the Netherlands, I look at this strictly through the lens of the Dutch privacy law (the AVG). Under the AVG, there's a hard line between anonymized data and pseudonymized data. Because of the exact 'mosaic effect' you mentioned, pseudonymized data is legally still treated as personal data. So, the re-identification risk is an accepted reality.
Essentialy i treat the tool as an extra effort to reduce PII leaks. But its not foolproof against the context clues.
_zer0c00l_ 2 days ago [-]
I have one (at least) fundamental concern about the approach - let's say I'm building an anti-fraud system that uses AI (through API), and maybe I'm asking AI whether my user totally+fraud@gmail.com is a potential fraudster. By masking this email address I'm sabotaging my own AI prompt - the AI cannot longer reason based on the facts that 1) the email is a free public email 2) the email says 'fraud' right in your face.
n00pn00p 2 days ago [-]
Valid point, the proxy has the option to always allow domain names through. You will lose some context always I fear. It should be used sparingly when you need a frontier model but also want to send sensitive data.
stuaxo 2 days ago [-]
You can do those as a sperate prompt.
dwa3592 1 days ago [-]
ooh nice. i built something exactly similar last year.
Curious if you have thought about that side of things with your V3 implementation?
Since I'm based in the Netherlands, I look at this strictly through the lens of the Dutch privacy law (the AVG). Under the AVG, there's a hard line between anonymized data and pseudonymized data. Because of the exact 'mosaic effect' you mentioned, pseudonymized data is legally still treated as personal data. So, the re-identification risk is an accepted reality.
Essentialy i treat the tool as an extra effort to reduce PII leaks. But its not foolproof against the context clues.
- https://github.com/deepanwadhwa/semi_private_chat
- https://github.com/deepanwadhwa/zink