GitHub - DataFog/datafog-python: Privacy Engineering for the Generative AI era

Open-source DevSecOps for Generative AI Systems.

Overview

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

Installation

DataFog can be installed via pip:

pip install datafog

Getting Started

To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:

from datafog import DataFog

# For text annotation
client = DataFog(operations="annotate_pii")

# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract_text")

Text PII Annotation

Here's an example of how to annotate PII in a text document:

import requests

# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]

# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)

OCR PII Annotation

For OCR capabilities, you can use the following:

import asyncio
import nest_asyncio

nest_asyncio.apply()


async def run_ocr_pipeline_demo():
    image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
    results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
    print("OCR Pipeline Results:", results)


loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the async/await syntax when calling the appropriate methods.

Examples

For more detailed examples, check out our Jupyter notebooks in the examples/ directory:

text_annotation_example.ipynb: Demonstrates text PII annotation
image_processing.ipynb: Shows OCR capabilities and text extraction from images

These notebooks provide step-by-step guides on how to use DataFog for various tasks.

Dev Notes

For local development:

Clone the repository.
Navigate to the project directory:
```
cd datafog-python
```
Create a new virtual environment (using .venv is recommended as it is hardcoded in the justfile):
```
python -m venv .venv
```
Activate the virtual environment:
- On Windows:
```
.venv\Scripts\activate
```
- On macOS/Linux:
```
source .venv/bin/activate
```
Install the package in editable mode:
```
pip install -e .
```
Set up the project:
```
just setup
```

Now, you can develop and run the project locally.

Important Actions:

Format the code:
```
just format
```
This runs isort to sort imports.
Lint the code:
```
just lint
```
This runs flake8 to check for linting errors.
Generate coverage report:
```
just coverage-html
```
This runs pytest and generates a coverage report in the htmlcov/ directory.

We use pre-commit to run checks locally before committing changes. Once installed, you can run:

pre-commit run --all-files

Dependencies

For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/ in the following files:

dev-cicd.yml
feature-cicd.yml
main-cicd.yml

Testing

Python 3.10

License

This software is published under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github		.github
.venv		.venv
datafog		datafog
examples		examples
public		public
tests		tests
.codecov.yml		.codecov.yml
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
CHANGELOG.MD		CHANGELOG.MD
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
env.example		env.example
error_log.txt		error_log.txt
justfile		justfile
package-lock.json		package-lock.json
package.json		package.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Installation

Getting Started

Text PII Annotation

OCR PII Annotation

Examples

Dev Notes

Important Actions:

Dependencies

Testing

License

About

Releases 6

Packages

Contributors 3

Languages

License

DataFog/datafog-python

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Getting Started

Text PII Annotation

OCR PII Annotation

Examples

Dev Notes

Important Actions:

Dependencies

Testing

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 3

Languages

Packages