Portfolio

A selection of recent work across data engineering, AI-powered document automation, and cloud-based ETL development.

1. Inspection Report Automation (OCR + Structured Data Extraction)

Industry: Property / Facility Management

Documents: Building inspection reports, routine inspections, defect reports

Technologies: Azure Document Intelligence, OCR, Layout analysis, Python, Azure OpenAI

Overview

Developed an end-to-end workflow to convert inspection report PDFs into structured JSON/Excel outputs.

The workflow performs OCR, layout parsing, table extraction, field normalisation, defect identification, and summary generation.

Outcome / Impact

  • Reduced manual review time from 20–40 minutes to 1–2 minutes per report
  • Achieved 83–92% extraction accuracy after tuning against client formats
  • Enabled automatic defect summaries and action lists
  • Integrated 10,000+ historical reports into a searchable knowledge base

2. Contract & Policy Document Extraction / Classification

Industry: Corporate / Legal / Real Estate

Documents: Tenancy agreements, supplier contracts, internal policies, compliance documents

Technologies: OCR, Azure OpenAI embeddings, Vector Search, Python

Overview

Built a document ingestion pipeline that extracts key clauses, dates, obligations, risks, and renewal conditions from contract PDFs.

Classification and metadata tagging were added for large document archives.

Outcome / Impact

  • Automated extraction of 20–40 key fields per contract
  • Implemented standard JSON schema for downstream systems
  • Achieved consistent clause identification across multiple contract formats
  • Enabled instant retrieval of similar clauses using semantic search

3. Routine Property Inspection Summary Generator

Industry: Real Estate Property Management

Documents: Routine inspection reports (property condition, photos, notes)

Technologies: OCR, Python, Image–Text validation, Summarisation models

Overview

Developed a system that combines text from inspectors’ notes with room-by-room photos.

The workflow detects inconsistencies, extracts condition ratings, and generates a structured summary.

Outcome / Impact

  • Standardised reporting quality across inspectors
  • Automated creation of summary sheets for owners/landlords
  • Reduced manual summarisation time by 70–90%
  • Improved consistency in property condition evaluations

4. ETL Pipeline Development for National Telecom (OPTUS)

Industry: Telecommunications

Technologies: Databricks, PySpark, Airflow, SQL, Delta Lake, AWS

Overview

Participated in the development of large-scale ETL pipelines for nationwide operational data.

Responsibilities included pipeline design, job optimisation, data quality monitoring, and Airflow orchestration.

Outcome / Impact

  • Improved job reliability and reduced processing time
  • Implemented robust data quality checks
  • Handled multi-terabyte distributed processing
  • Ensured production-grade deployment and documentation

5. Supplier Data Warehouse for a Trading & Manufacturing Group (Marubeni)

Industry: Trading / Manufacturing

Technologies: AWS (Lambda, DynamoDB, S3, Glue), Python, SQL

Overview

Designed and implemented an ETL pipeline migrating supplier data from DynamoDB into a structured warehouse schema.

Developed Python modules for data transformation and schema alignment.

Outcome / Impact

  • Automated daily ingestion and transformation
  • Designed star schema for analytics
  • Enabled downstream Power BI dashboards
  • Improved data consistency across business units

6. BI Integration for ServiceNow and External SaaS

Industry: IT Service / CRM

Technologies: ServiceNow, API Integration, Azure SQL, Data Factory, Python

Overview

Built real-time integrations between ServiceNow and external SaaS platforms.

Performed data cleansing, transformation, and modelling for BI reporting.

Outcome / Impact

  • Reduced manual reconciliation work
  • Delivered real-time operational dashboards
  • Improved reliability of cross-system data

7. Large-Scale Web Scraping & Distributed Processing (DataSection)

Industry: E-commerce / Market Intelligence

Technologies: Python, Java, AWS EC2, Redis, S3, SQS, Distributed crawling

Overview

Designed and built distributed web scraping architecture collecting data for millions of products.

Implemented proxy pools, retry logic, failover mechanisms, and automated pipelines.

Outcome / Impact

  • Scaled to millions of records per day
  • Reduced scraping failure rate with smart retry logic
  • Achieved stable and cost-efficient distributed crawling

8. ML-Based Document Search (Azure + LangChain)

Industry: Professional Services

Technologies: Azure OpenAI, Azure Search, Python, LangChain

Overview

Developed a hybrid search system that combines embeddings-based retrieval, keyword fallback, and metadata filtering.

Used for document discovery across multiple departments.

Outcome / Impact

  • Improved search precision compared to keyword-only systems
  • Reduced time spent locating relevant documents
  • Supported bilingual content (English/Japanese)

Core Technical Capabilities

  • OCR / layout analysis
  • Document classification and extraction
  • LLM-based summarisation and validation
  • Databricks / PySpark ETL
  • Airflow / orchestration
  • AWS / Azure cloud pipelines
  • API integrations
  • Data modelling and warehouse design

Contact

For detailed case studies or sample outputs, please reach out:

📧 nueki@manoriworks.com

🌐 AI-Powered Document Automation : https://manoriworks.com/en

🔗 LinkedIn