﻿Slovene Module for NooJ -- version 29-12-2014 -- includes:


#####CORPORA#####
--> ssj500k corpus
[ssj500k_v1.2.not]
Manually annotated training corpus developed within the Communication in Slovene project.
Size: 500.295 word tokens
  Description and license: http://eng.slovenscina.eu/tehnologije/ucni-korpus
  TEI Header URL: http://nl.ijs.si/ssj/ssj500k/ssj500k-en.teiHeader.html
  Original file download (TEI P5 XML): http://nl.ijs.si/ssj/ssj500k/ssj500kv1_2.xml.gz
  Annotation: manual >> tokenization/segmentation (all), lemmatization (all), morphosyntactic annotation (all),
  dependency relations (~11,000 sentences), named entities (~100,000)


--> Telesni cuvaj corpus
[MihaMazzini_TelesniCuvaj.not]
Unannotated (raw text) novel by Miha Mazzini (2000, 2004) with randomized paragraphs.
Size: 78.367 word tokens
  Official website URL: http://www.telesnicuvaj.com/
  The novel has also been published in English (under the title of Guarding Hanna in 2002 and 2008),
  Polish (Pies, 2008), Serbian (Hanin telohranitelj, 2008) and Italian (Mi chiamavano il cane, 2011).
  wiki-SI: http://sl.wikipedia.org/wiki/Telesni_%C4%8Duvaj_%28roman%29
  wiki-EN: http://en.wikipedia.org/wiki/Guarding_Hanna


--> ccKres corpus
[ccKres_v1.0.noc]
Size: 10.000.532 word tokens
  Description and license: http://eng.slovenscina.eu/korpusi/proste-zbirke
  Original file download (TEI P5 XML): http://www.slovenscina.eu/dat/korpusi/cckresV1_0.zip
  Annotation: automatic >> tokenization/segmentation, lemmatization, morphosyntactic annotation
  (NB: Due to its large size, this corpus is available as a seperate package. If you are interested in obtaining a copy of the ccKres corpus,
  please contact kaja.dobrovoljc@trojina.si).


#####DICTIONARIES#####
--> Sloleks dictionary
[Sloleks4NooJ.nod]
Reference morphological dictionary with some minor modifications (see Nooj 2013 Conference paper) and  new entries from the Telesni cuvaj corpus.
Size: 2.741.373 word forms (~ 100.000 lemmas)
  Description and license: http://eng.slovenscina.eu/sloleks/opis
  Original file download (LMF XML): http://eng.slovenscina.eu/sloleks/prenos

--> Dictionary of adverbs
Manually evaluated, corrected and extended list of adverbs from the Sloleks dictionary and the ccKres and ccGigafida reference corpora
with links to patterns for comparison (Adverbs_graph.nof) (i.e. NooJ dictionary proper).
[Sloleks4NooJ-adverbs-v2.dic]
[Sloleks4NooJ-adverbs-v2.nod]
Size: 7.327 lemmas
  Sloleks description and license: http://eng.slovenscina.eu/sloleks/opis
  Sloleks original file download (LMF XML): http://eng.slovenscina.eu/sloleks/prenos


#####GRAMMARS: inflection and derivation#####
--> Patterns for declension of feminine nouns
[Nf_declension.nof]

--> Patterns for comparison of adverbs
[Advebs_graph.nof]
[Adverbs_rule.nof]

#####GRAMMARS: morphology#####
--> Gerunds
[Gerunds.nom]
Morphological grammar that identifies Slovene gerunds (nouns),
determines their type (ending) and the verb they derive from.

--> Roman Numerals
[RomanNumerals.nom]
Morphological grammar identifies Roman numerals (up to 3999) and converts them to Arabic numeral system.

--> Debugging Sloleks (adverbs)
[GRAM_1_find new inflected forms of existing.nom]
One of the morphological grammars used for Sloleks evaluation.
Looks for potential inflected forms of adverbs originally marked as non-inflectable.


#####GRAMMARS: syntax#####
--> Numerals
[Numerals.nog]
Syntactic grammar that recognizes numerals written in words and converts them to Arabic numeral system.

--> Proper Names
[ProperNames.nog]
Syntactic grammar that recognizes potential proper names from their letter case and the preceding context.

--> Adverbial phrases of time
[AdvP_time.nog]
Syntactic grammar that recognizes adverbial phrases that indicate a certain point in time,
e.g. "v ponedeljek, 13. 12. 2001, točno ob 16:32 po našem času".

--> Noun phrases
[NP.nog]
Syntactic grammar that recognizes single or coordinating noun phrases.

--> Prepositional phrases
[PP.nog]
Syntactic grammar that recognizes prepositional phrases.


#####PDF#####
[_properties.def]
Properties definition file containing grammatical properties
as defined within the MTE-JOS tagset. Properties that are not part
of the dictionary and only occur in morphological grammars have also been
added and marked accordingly.