PDF-Fragenkatalog in CSV umwandeln

Garrett

Here to stay

Registered: May 2003
Location: Wien
Posts: 1099

29.01.2024 - 18:19

Liebe alle, bin mir nicht sicher ob ich im richtigen Forum bin.

Jedenfalls habe ich einen Multiplechoice Fragenkatalog in PDF Form mit 665 Fragen. Die würd ich zum Lernen gern in ein Anki Deck umwandlen. Dafür müsste ich es vorher in ein CSV bringen.

Spalte A soll dabei die Frage inkl. Antwortmöglichkeiten beinhalten. Und Spalte B die Antwort.

Wenn ich es manuell mit Copy & Paste mache benötige ich pro Frage 20 Sekunden und damit fast 4 Stunden.

Hat wer einen Tipp (paar Zeilen Code?

) parat, wie ich diese Zeit reduzieren könnte?

ecqb-ppla_de_gesamt_2019656-fragen_269029.pdf

hynk

Vereinsmitglied
like totally ambivalent

Registered: Apr 2003
Location: Linz
Posts: 11083

30.01.2024 - 01:10

Im Anhang findest du das Ergebnis von einer Stunde Prompt basteln mit ChatGPT.
Die Ausgabe habe ich beschränkt auf Seite 3 bis 22. Sollte das Ergebnis passen, mach ich dir gerne morgen den Rest.

100% funktioniert das Prompt noch nicht, aber der Weg scheint mir brauchbar.

Code:

Help me extract information from this pdf. The pdf contains Multiple Choice Questions and its associated answers, beginning on page 3 up to page 22. 

Output the Number of the question, the question, possible answers and the choices available in a csv, like described below.

Ignore all content that is not recoqnizable as part of the quiz, like images, graphs, copyright remarks and so on.
also ignore the headers (example: "10 Luftrecht ECQB-PPL(A)" ) and footers (example: "v2019.2 21") of the pdf.

There are two kinds of questions. Questions like the first (1) on page 3 and questions like the second (2) on page 3. 

For the first kind use the following approach:
Use the first column for the Questions Number, second column for the question, third column for the possible answers, fourth column for the first choice, fifth column for the second choice, sixth column for the third choice, seventh column for the fourth choice. 

Like in the following example:
***
Number|Question|Options|1st Choice|2nd Choice|3rd Choice|4th Choice
1|Welche dieser Dokumente müssen auf internationalen Flügen immer mitgeführt 
werden?|a) Eintragungsschein b) Lufttüchtigkeitszeugnis c) Bescheinigung über die Nachprüfung der Lufttüchtigkeit d) EASA Form-1 e) Bordbuch f) Entsprechende Ausweise für jedes Besatzungsmitglied g) Technische Lebenslaufakte (1,00 P.)|b, c, d, e, f, g.|d, f, g.|[x] a, b, c, e, f.|a, b, e, g.
***

For the second kind use the following approach
Use the first column for the Questions Number, second column for the question, leave the third column empty, fourth column for the first choice, fifth column for the second choice, sixth column for the third choice, seventh column for the fourth choice. 

Like in the following example
***
Number|Question|Options|1st Choice|2nd Choice|3rd Choice|4th Choice
2|Wie wird ein Gebiet bezeichnet, in welches der Einflug nur mit bestimmten Auflagen 
erlaubt ist? (1,00 P.)|null|Gefahrengebiet|[x]Flugbeschränkungsgebiet|Flugverbotszone|Luftsperrgebiet
***

Make the csv downloadable, use a charset suitable for german, use "|" as a delimitter.

extracted_questions_269031.csv (downloaded 15x)

voyager

kühler versilberer :)

Registered: Nov 2001
Location: Stmk/Austria
Posts: 3848

30.01.2024 - 06:59

Mit Acrobat Pro kann man excel exportieren, wird aber sicher nachbearbeitung notwenig sein.

hynk

Vereinsmitglied
like totally ambivalent

Registered: Apr 2003
Location: Linz
Posts: 11083

30.01.2024 - 07:22

Ah, auch ein guter Zugang. Was dir das Leben auch noch erleichtern kann ist das klassische Windows Snipping Tool. Das kann seit kurzem OCR.
Bedeutet aber auch, Screenshot einer Seite machen, OCR, Paste, nachbearbeiten.

Ein DMS mit OCR könnte auch helfen den Robtext raus zu bekommen. Aber hier wieder das selbe mit dem nachbearbeiten.

p1perAT

-

Registered: Sep 2009
Location: AT
Posts: 2953

30.01.2024 - 07:28

Als Alternative zu Acrobat Pro, vielleicht klappt ein Export/OCR auch mit PDF24.

DKCH

Administrator
...

Registered: Aug 2002
Location: #
Posts: 3308

30.01.2024 - 07:58

oder mit libreoffice öffnen und dort rauskopieren/als text speichern

COLOSSUS

Administrator
GNUltra

Registered: Dec 2000
Location: ~
Posts: 12148

30.01.2024 - 08:22

Ich wuerde mit https://github.com/atlanhq/camelot anfangen.

Kirby

0x20

Registered: Jun 2017
Location: Lesachtal/Villac..
Posts: 952

30.01.2024 - 09:20

alternativ gibt es unter linux die möglichkeit von pdf2text

Adobe pdftotext

da musst halt wieder ewig nacharbeiten damit das format für dich passt.

Garrett

Here to stay

Registered: May 2003
Location: Wien
Posts: 1099

30.01.2024 - 09:34

Zitat aus einem Post von hynk
Im Anhang findest du das Ergebnis von einer Stunde Prompt basteln mit ChatGPT.
Die Ausgabe habe ich beschränkt auf Seite 3 bis 22. Sollte das Ergebnis passen, mach ich dir gerne morgen den Rest.

100% funktioniert das Prompt noch nicht, aber der Weg scheint mir brauchbar.

Erstmals vielen Dank für die Zeit die du investiert hast. Aber seh ich das richtig, die korrekte Antwort ist jetzt nicht vermerkt?

Edit: Die Fragen sind auch hier online verfügbar, falls das was hilft.
http://ato.fsv2000.com/fragenkatalog/

Bearbeitet von Garrett am 30.01.2024, 09:56

berndy2001

Registered: Feb 2003
Location: Vienna
Posts: 2046

30.01.2024 - 10:26

Im Quelltext stehen die Fragen, mögliche Antworten und richtige Antwort in einem json array. Besser kanns gar nicht sein.

Code:

{
    "top": 0,
    "nr": 2,
    "imgs": [],
    "txt": "Wie wird ein Gebiet bezeichnet, in welches der Einflug nur mit bestimmten Auflagen\nerlaubt ist? (1,00 P.)",
    "corans": 2,
    "ans": [
        "Luftsperrgebiet",
        "Gefahrengebiet",
        "Flugbeschränkungsgebiet",
        "Flugverbotszone"
    ]
}

very quick, very dirty:

result_269035.zip (downloaded 13x)

Bearbeitet von berndy2001 am 30.01.2024, 13:17

hynk

Vereinsmitglied
like totally ambivalent

Registered: Apr 2003
Location: Linz
Posts: 11083

30.01.2024 - 10:39

Gerne. Habs aus Eigeneinteresse gemacht, wie gut das mittlerweile funktioniert.
Die korrekte Antwort müsste man GPT noch rauslocken und in eine separate Spalte bringen. Da bin ich dann gestern schlafen gegangen

Mit dem Online-Fragenkatalog hast du aber schon gewonnen. Viel besseres Material als das PDF.

berndy2001

Registered: Feb 2003
Location: Vienna
Posts: 2046

30.01.2024 - 14:12

nodejs:

Code:

const XLSX = require('xlsx');
const data = [{"id":"ppl_..........,"ans":["1630","1330","1430","1230"]}]}]
   
const workbook = XLSX.utils.book_new();
const worksheet = XLSX.utils.json_to_sheet(data[0].questions.map(item => ({
    Frage: item.txt,
    Antwort: ['A', 'B', 'C', 'D'][item.corans]
})));

XLSX.utils.book_append_sheet(workbook, worksheet, "Fragen und Antworten");

XLSX.writeFile(workbook, "fragen_und_antworten.xlsx");

2024-01-30-14_11_40-fragen_und_antworten-xlsx-excel_269041.png

Garrett

Here to stay

Registered: May 2003
Location: Wien
Posts: 1099

30.01.2024 - 14:32

Danke euch allen! Habs letztendlich basierend auf berndy2001 Auswertung in Excel gelöst. <3

Eine letzte Frage/Bitte hab ich noch: Ich bräuchte alle Bildfiles von http://ato.fsv2000.com/fragenkatalog/
Kann mir da noch wer helfen?

berndy2001

Registered: Feb 2003
Location: Vienna
Posts: 2046

30.01.2024 - 16:23

Code:

const XLSX = require('xlsx');
const http = require('http'); // or 'https' for [url]https://[/url] URLs
const fs = require('fs');

const data = [{"id":"ppl_..........,"ans":["1630","1330","1430","1230"]}]}]

const workbook = XLSX.utils.book_new();
const worksheet = XLSX.utils.json_to_sheet(data[0].questions.map(item => ({
    Frage: item.txt,
    Antwort: ['A', 'B', 'C', 'D'][item.corans]
})));

XLSX.utils.book_append_sheet(workbook, worksheet, "Fragen und Antworten");
XLSX.writeFile(workbook, "fragen_und_antworten.xlsx");

var imgs = [...new Set(data[0].questions.filter(item => item.imgs.length).map(item => (item.imgs)).flat())]
for (i in imgs) {
    console.log('http://ato.fsv2000.com/fragenkatalog/imgs/' + imgs[i]);
    const file = fs.createWriteStream(imgs[i]);
    const request = http.get('http://ato.fsv2000.com/fragenkatalog/imgs/' + imgs[i], function(response) {
        response.pipe(file);
        file.on("finish", () => {
            file.close();
            console.log("Download Completed");
        });
    });
}

Bearbeitet von berndy2001 am 30.01.2024, 16:39

Garrett Here to stay Registered: May 2003 Location: Wien Posts: 1099	29.01.2024 - 18:19 Liebe alle, bin mir nicht sicher ob ich im richtigen Forum bin. Jedenfalls habe ich einen Multiplechoice Fragenkatalog in PDF Form mit 665 Fragen. Die würd ich zum Lernen gern in ein Anki Deck umwandlen. Dafür müsste ich es vorher in ein CSV bringen. Spalte A soll dabei die Frage inkl. Antwortmöglichkeiten beinhalten. Und Spalte B die Antwort. Wenn ich es manuell mit Copy & Paste mache benötige ich pro Frage 20 Sekunden und damit fast 4 Stunden. Hat wer einen Tipp (paar Zeilen Code? ) parat, wie ich diese Zeit reduzieren könnte? ecqb-ppla_de_gesamt_2019656-fragen_269029.pdf
hynk Vereinsmitglied like totally ambivalent Registered: Apr 2003 Location: Linz Posts: 11083	30.01.2024 - 01:10 Im Anhang findest du das Ergebnis von einer Stunde Prompt basteln mit ChatGPT. Die Ausgabe habe ich beschränkt auf Seite 3 bis 22. Sollte das Ergebnis passen, mach ich dir gerne morgen den Rest. 100% funktioniert das Prompt noch nicht, aber der Weg scheint mir brauchbar. Code: Help me extract information from this pdf. The pdf contains Multiple Choice Questions and its associated answers, beginning on page 3 up to page 22. Output the Number of the question, the question, possible answers and the choices available in a csv, like described below. Ignore all content that is not recoqnizable as part of the quiz, like images, graphs, copyright remarks and so on. also ignore the headers (example: "10 Luftrecht ECQB-PPL(A)" ) and footers (example: "v2019.2 21") of the pdf. There are two kinds of questions. Questions like the first (1) on page 3 and questions like the second (2) on page 3. For the first kind use the following approach: Use the first column for the Questions Number, second column for the question, third column for the possible answers, fourth column for the first choice, fifth column for the second choice, sixth column for the third choice, seventh column for the fourth choice. Like in the following example: * Number\|Question\|Options\|1st Choice\|2nd Choice\|3rd Choice\|4th Choice 1\|Welche dieser Dokumente müssen auf internationalen Flügen immer mitgeführt werden?\|a) Eintragungsschein b) Lufttüchtigkeitszeugnis c) Bescheinigung über die Nachprüfung der Lufttüchtigkeit d) EASA Form-1 e) Bordbuch f) Entsprechende Ausweise für jedes Besatzungsmitglied g) Technische Lebenslaufakte (1,00 P.)\|b, c, d, e, f, g.\|d, f, g.\|[x] a, b, c, e, f.\|a, b, e, g. * For the second kind use the following approach Use the first column for the Questions Number, second column for the question, leave the third column empty, fourth column for the first choice, fifth column for the second choice, sixth column for the third choice, seventh column for the fourth choice. Like in the following example * Number\|Question\|Options\|1st Choice\|2nd Choice\|3rd Choice\|4th Choice 2\|Wie wird ein Gebiet bezeichnet, in welches der Einflug nur mit bestimmten Auflagen erlaubt ist? (1,00 P.)\|null\|Gefahrengebiet\|[x]Flugbeschränkungsgebiet\|Flugverbotszone\|Luftsperrgebiet * Make the csv downloadable, use a charset suitable for german, use "\|" as a delimitter. extracted_questions_269031.csv (downloaded 15x)
voyager kühler versilberer :) Registered: Nov 2001 Location: Stmk/Austria Posts: 3848	30.01.2024 - 06:59 Mit Acrobat Pro kann man excel exportieren, wird aber sicher nachbearbeitung notwenig sein.
hynk Vereinsmitglied like totally ambivalent Registered: Apr 2003 Location: Linz Posts: 11083	30.01.2024 - 07:22 Ah, auch ein guter Zugang. Was dir das Leben auch noch erleichtern kann ist das klassische Windows Snipping Tool. Das kann seit kurzem OCR. Bedeutet aber auch, Screenshot einer Seite machen, OCR, Paste, nachbearbeiten. Ein DMS mit OCR könnte auch helfen den Robtext raus zu bekommen. Aber hier wieder das selbe mit dem nachbearbeiten.
p1perAT - Registered: Sep 2009 Location: AT Posts: 2953	30.01.2024 - 07:28 Als Alternative zu Acrobat Pro, vielleicht klappt ein Export/OCR auch mit PDF24.
DKCH Administrator ... Registered: Aug 2002 Location: # Posts: 3308	30.01.2024 - 07:58 oder mit libreoffice öffnen und dort rauskopieren/als text speichern
COLOSSUS Administrator GNUltra Registered: Dec 2000 Location: ~ Posts: 12148	30.01.2024 - 08:22 Ich wuerde mit https://github.com/atlanhq/camelot anfangen.
Kirby 0x20 Registered: Jun 2017 Location: Lesachtal/Villac.. Posts: 952	30.01.2024 - 09:20 alternativ gibt es unter linux die möglichkeit von pdf2text Adobe pdftotext da musst halt wieder ewig nacharbeiten damit das format für dich passt.
Garrett Here to stay Registered: May 2003 Location: Wien Posts: 1099	30.01.2024 - 09:34 Zitat aus einem Post von hynk Im Anhang findest du das Ergebnis von einer Stunde Prompt basteln mit ChatGPT. Die Ausgabe habe ich beschränkt auf Seite 3 bis 22. Sollte das Ergebnis passen, mach ich dir gerne morgen den Rest. 100% funktioniert das Prompt noch nicht, aber der Weg scheint mir brauchbar. Erstmals vielen Dank für die Zeit die du investiert hast. Aber seh ich das richtig, die korrekte Antwort ist jetzt nicht vermerkt? Edit: Die Fragen sind auch hier online verfügbar, falls das was hilft. http://ato.fsv2000.com/fragenkatalog/ Bearbeitet von Garrett am 30.01.2024, 09:56
berndy2001 Registered: Feb 2003 Location: Vienna Posts: 2046	30.01.2024 - 10:26 Im Quelltext stehen die Fragen, mögliche Antworten und richtige Antwort in einem json array. Besser kanns gar nicht sein. Code: `{ "top": 0, "nr": 2, "imgs": [], "txt": "Wie wird ein Gebiet bezeichnet, in welches der Einflug nur mit bestimmten Auflagen\nerlaubt ist? (1,00 P.)", "corans": 2, "ans": [ "Luftsperrgebiet", "Gefahrengebiet", "Flugbeschränkungsgebiet", "Flugverbotszone" ] }` very quick, very dirty: result_269035.zip (downloaded 13x) Bearbeitet von berndy2001 am 30.01.2024, 13:17
hynk Vereinsmitglied like totally ambivalent Registered: Apr 2003 Location: Linz Posts: 11083	30.01.2024 - 10:39 Gerne. Habs aus Eigeneinteresse gemacht, wie gut das mittlerweile funktioniert. Die korrekte Antwort müsste man GPT noch rauslocken und in eine separate Spalte bringen. Da bin ich dann gestern schlafen gegangen Mit dem Online-Fragenkatalog hast du aber schon gewonnen. Viel besseres Material als das PDF.
berndy2001 Registered: Feb 2003 Location: Vienna Posts: 2046	30.01.2024 - 14:12 nodejs: Code: `const XLSX = require('xlsx'); const data = [{"id":"ppl_..........,"ans":["1630","1330","1430","1230"]}]}] const workbook = XLSX.utils.book_new(); const worksheet = XLSX.utils.json_to_sheet(data[0].questions.map(item => ({ Frage: item.txt, Antwort: ['A', 'B', 'C', 'D'][item.corans] }))); XLSX.utils.book_append_sheet(workbook, worksheet, "Fragen und Antworten"); XLSX.writeFile(workbook, "fragen_und_antworten.xlsx");`
Garrett Here to stay Registered: May 2003 Location: Wien Posts: 1099	30.01.2024 - 14:32 Danke euch allen! Habs letztendlich basierend auf berndy2001 Auswertung in Excel gelöst. <3 Eine letzte Frage/Bitte hab ich noch: Ich bräuchte alle Bildfiles von http://ato.fsv2000.com/fragenkatalog/ Kann mir da noch wer helfen?
berndy2001 Registered: Feb 2003 Location: Vienna Posts: 2046	30.01.2024 - 16:23 Code: const XLSX = require('xlsx'); const http = require('http'); // or 'https' for [url]https://[/url] URLs const fs = require('fs'); const data = [{"id":"ppl_..........,"ans":["1630","1330","1430","1230"]}]}] const workbook = XLSX.utils.book_new(); const worksheet = XLSX.utils.json_to_sheet(data[0].questions.map(item => ({ Frage: item.txt, Antwort: ['A', 'B', 'C', 'D'][item.corans] }))); XLSX.utils.book_append_sheet(workbook, worksheet, "Fragen und Antworten"); XLSX.writeFile(workbook, "fragen_und_antworten.xlsx"); var imgs = [...new Set(data[0].questions.filter(item => item.imgs.length).map(item => (item.imgs)).flat())] for (i in imgs) { console.log('http://ato.fsv2000.com/fragenkatalog/imgs/' + imgs[i]); const file = fs.createWriteStream(imgs[i]); const request = http.get('http://ato.fsv2000.com/fragenkatalog/imgs/' + imgs[i], function(response) { response.pipe(file); file.on("finish", () => { file.close(); console.log("Download Completed"); }); }); } Bearbeitet von berndy2001 am 30.01.2024, 16:39

PDF-Fragenkatalog in CSV umwandeln

Forum Index > Software > Coding Stuff

Garrett

hynk

voyager

hynk

p1perAT

DKCH

COLOSSUS

Kirby

Garrett

berndy2001

hynk

berndy2001

Garrett

berndy2001